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INVESTIGATION  OF  MODULARLY  CONFIGURED 
ATTACHED  PROCESSORS  WITH  INTELLIGENT  MEMORIES 


1.  STATUS  OF  PROJECT 

Objective  1  (register- level  design  of  MCAP) :  Block  diagrams 
of  all  MCAP  components  have  been  completed  and  documented 
(see  Attachment  A)  by  Glenn  Gibson  and  an  undergraduate 
student.  These  diagrams  include  all  of  the  important 
registers  in  the  components.  More  detailed  designs  based  on 
these  block  diagrams  are  currently  being  done  in  order  to 
fulfill  the  objectives  below. 

Objective  2  (architecture/algorithm  case  studies) ;  The 

simulator  software  package  is  now  complete  and  this  study  is 
now  progressing  at  a  stepped  up  pace.  While  learning  the 
MCAP  system,  a  doctoral  student  under  the  direction  of 
Gibson  has  designed  an  architecture  for  performing  matrix 
operations  and  has  written  programs  for  executing  matrix 
multiplication  and  matrix  inversion.  If  implemented  in  1 
micron  CMOS,  the  architecture  should  have  a  peak  performance 
of  250  MFlops/s.  Simulations  of  matrix  multiplication  can 
be  carried  out  at  slightly  more  than  240  MFlops/s,  giving  a 
processor  efficiency  of  approximately  96%.  The  processor 
efficiency  for  matrix  inversion  is  currently  less  then  40%, 
but  the  program  is  being  rewritten  and  we  are  hoping  to 
attain  an  efficiency  of  over  60%.  Another  program  for 
performing  Gaussian  elimination  will  then  be  written.  Upon 
completion  of  this  work  an  article  will  be  submitted  to  the 
Journal  on  Computer  Simulation.  An  undergraduate  student, 
also  under  the  direction  of  Gibson,  has  designed  an 
architecture  for  performing  iterative  solutions  to  two- 
dimensional  partial  differential  equations.  Although  he  has 
completed  a  program  for  solving  LaPlace's  equation,  his 
results  are  premature.  Programs  for  other  first  and  second 
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order  equations  will  also  be  written.  A  doctoral  student 
and  two  master's  level  students  are,  under  the  supervision 
of  Sergio  Cabrera,  developing  architectures  and  writing 
programs  for  performing  FFTs  and  signal  processing 
algorithms.  This  work  began  in  January.  Although  it  took 
th2  students  some  time  to  learn  the  MCAP  design  philosophy, 
thsir  effort  is  progressing  well  now.  Other  journal 
suomissions  will  be  made  as  this  work  matures. 


Objective  3  (two  memory  controller  designs) :  Yu -Cheng  Liu  is 
directing  a  master's  level  student  and  an  undergraduate 
student  in  this  work.  It  began  in  earnest  in  November, 
after  the  block  diagrams  in  Attachment  A  were  finalized. 

The  designs  are  at  the  logic  level  and  are  being  done  using 
the  Mentor  Graphics  software  package.  The  address 
generation  and  partition  pattern  logic  for  the  single-access 
(S)  and  dual-  access  (D)  components  have  been  completed. 
Also,  a  preliminary  study  of  an  alternative  design  that  uses 
shift  registers  has  been  done.  Their  work  to  date  is  being 
submitted  to  the  '95  ISCA  Int ' 1  Conf .  on  Computer  Appl .  in 
Industry  and  Engr.  in  Honolulu. 


Objective  4  (technology  evaluations) :  This  work  involves  two 
doctoral  students  and  two  master's  level  students  and  is 
being  guided  by  Vi jay  Singh.  Singh,  Gibson,  two  doctoral 
students  and  one  master's  level  student  attended  the 
multichip  module  conference,  MCMC  '95,  in  Santa  Cruz, 

California,  in  January  and  presented  a  paper  (see  Attachment 
C) .  The  presentation  was  made  by  one  of  the  doctoral 
students.  Buck  Gremel .  Also,  Singh's  group  has  produced  a 
paper  that  has  been  accepted  by  the  Int'l  Journal  of 
Electronics  (see  Attachment  D)  and  has  submitted  a  paper  to 
the  IEEE  J.  on  Components,  Packaging  and  Manufacturing  Tech. 

To  date,  the  group  has  concentrated  its  efforts  on  the  MCM 
CMOS  design  of  an  architecture  for  performing  matrix 
operations,  but  one  of  the  master's  level  students  has  been 

investigating  GaAs  DCFL  technology.  Preliminary  work  on  ...  _ 

Wafer  Scale  Integration  (WSI)  has  also  been  done  by  Yi-  'op _ 2'-'' 

Chieh  Chang  and  a  graduate  student  under  the  supervision  of 

Singh.  They  will  study  the  WSI  implementation  of  MCAPs  this  □  | 

summer. 


Objective  5  (simulator  development)  :  Although  the  simulator  * 

software  package,  SIMARC,  is  complete  and  currently  in  use, 

several  enhancements  are  being  made.  These  enhancements  are  r'T’l  " 

■  i.-oas® 


being  done  by  a  master's  level  student  and  an  undergraduate 
student  under  the  direction  of  Gibson.  They  primarily 
concern  the  display  of  simulation  results  and  the  ability  to 
dynamically  change  an  architecture's  attributes  while  a 
simulation  is  being  executed.  It  has  been  found  that  the 
principal  difficulty  in  using  the  SIMARC  package  is  in 
writing  programs  for  the  algorithms.  The  master's  level 
student  has  also  begun  working  on  a  graphically  assisted 
assembler  designed  to  alleviate  this  problem.  The 
instruction  set  for  MCAPs  has  been  finalized  and 
documentation  of  the  SIMARC  software  began  in  December  (see 
Attachment  B)  and  will  continue  through  next  summer.  A 
paper  on  this  package  will  be  presented  at  the  '95 
Simulation  Mult iConf erence  in  Phoenix  in  April  (see 
Attachment  E) . 

2 .  CURRENT  LEVEL  OF  EFFORT 

Although  no  theses  were  completed  during  this  six-month 
period,  the  level  of  effort  has  increased  dramatically.  In 
addition  to  the  paper  presented  in  Santa  Cruz,  the  four 
papers  listed  in  the  October  1,  1994,  report  were  presented 
in  Taiwan  and  San  Francisco.  At  present,  there  are  four 
doctoral  stu-dents,  eight  master's  level  students  and  three 
undergraduates  involved  in  the  project.  Of  these,  four  of 
the  graduate  students  are  not  currently  supported  (two  are 
part-time  students)  and  two  undergraduates  are  being 
supported  by  stipends  provided  by  the  University.  All 
others  are  supported  by  stipends  and  research  assistantships 
through  this  grant . 

3.  PAPERS  ACCEPTED  FOR  PUBLICATION  (see  attachments) 

J.  Singh,  B.Gremel,  V.  Singh  and  G.  Gibson,  "Design  Consid¬ 
erations  for  implementing  a  Modularly  Configured  Attached 
Processor  in  a  Multi-Chip  Module,"  Proc .  of  MCMC  '95,  Santa 
Clara,  CA,  Jan.,  1995,  pp.  62-65. 

G.  Gibson,  A.  Brito,  Y.  Chang,  D.  Saenz  and  E.  Castro, 
"Simu-lation  and  Fast  Prototyping  of  Modularly  Configurable 
Attached  Processors,"  Proc.  of  1995  Multi  conference  on  High 
Performance  Computing,  Phoenix,  AZ,  April,  1995. 

J.  Singh,  B.  Gremel,  V.  Singh,  and  G.  Gibson,  "Design  Issues 
in  a  CMOS  Implementation  of  a  Modularly  Configured  Attached 


Processor",  accepted  by  Int'l  J.  of  Electronics . 

PAPERS  SUBMITTED  FOR  PUBLICATION 

J.  Singh,  S.  Nagabathula,  V.  Singh  and  G.  Gibson, 
"Comparative  Evaluation  of  MCM  and  WSI  Schemes  for 
Implementing  a  Modularly  Configured  Attached  Processor 
Architecture,"  submitted  to  IEEE  Trans,  on  Components, 
Packaging  and  Manufacturing  Tech. --Part  B:  Adv.  Packaging. 

Y.  C.  Liu,  G.  Gibson  and  S.  Vaishampayan,  "Intelligent 
Memory  Controllers  for  Modularly  Configured  Attached 
Processors,"  submitted  to  '95  Int'l  Conf .  on  Computer  Appl . 
in  Industry  and  Engr.  in  Honolulu,  Hawaii. 


CHAPTER  1 

MCAP  DEFINITION 


An  MCAP  is  an  attached  processor  that  is  constructed  entirely  from  a  standard  set  of 
connections  and  components.  This  set  consists  of  two  types  of  connections  and  ten  types  of 
components.  The  definitions  of  the  connection  and  component  types  provide  a  standard  set  of 
rules  that  allow  the  components  to  be  e<isily  configured  in  different  ways  to  construct  attached 
processors  that  efficiently  perform  different  sets  of  algorithms. 

An  MCAP  is  connected  to  its  host  using  separate  instruction  and  data  streams.  There  is  one 
instruction  stream  and  it  flows  from  the  hosts  memory  to  the  MCAP’s  only  instruction  com¬ 
ponent.  The  instruction  stream  is  depicted  in  Fig. 1.1.  It  is  assumed  that,  for  each  algorithm 
the  MCAP  is  to  execute,  the  instructions  for  the  algorithm  have  been  permanently  stored  as  a 
subroutine  in  a  ROM  in  the  instruction  component.  The  instructions  sent  from  the  host  to  the 
instruction  component’s  R.4M  designate  which  algorithm  is  to  be  executed  and  the  parameters 
needed  by  the  algorithm(e.g.,  sizes  and  main  memory  locations  of  matrices).  An  algorithm  is 
executed  by  drawing  from  the  set  of  instructions  received  from  both  the  host  and  subroutine. 
Using  these  instructions  and  the  instruction  component’s  register  set  and  interna]  logic,  an  in¬ 
struction  stream,  called  the  stream  of  external  instructions,  is  produced  and  directed  to  the  other 
component’s  in  the  MC.AP.  The  stream  of  external  instructions  provides  the  other  components 
with  the  information  needed  to  perform  the  algorithm.  Each  external  instructon  is  directed  to 
a  component  by  putting  the  component’s  identifying  address  (i.e.,  component  number)  on  the 
address  bus.  The  receiving  component  causes  the  contents  of  the  instructon  to  be  distributed  to 
its  appropriate  registers.  It  is  the  contents  of  the  component’s  registers  that  dictate  its  actions 
during  the  execution  of  an  algorithm.  For  the  given  algorithm,  the  set  of  instructions  that  a 
component  receives  must  be  sufficient  to  fill  the  registers  needed  by  the  algorithm. 

There  may  be  multiple  data  streams  between  the  host  and  MCAP  and  they  are  connections 
to  one  or  more  of  the  MC.\P's  memory  subsystem  components.  The  primary  purpose  of  these 
components  is  to  buffer  data  between  the  host  and  the  MCAP,  although  they  may  also  rearrange 
the  data  being  input  from  or  output  to  the  host.  Once  the  data  has  been  put  into  the  memory 
subsystem  components,  other  MCAP  components  may  be  used  to  process  the  data  and  return 


results  to  the  memory  system  components,  from  which  the  results  may  be  output  to  the  host’s 
main  memory.  All  movement  and  processing  of  data  to  and  from  the  host  and  within  the  MCAP 
is  determined  by  the  external  instructions  sent  from  the  MCAP’s  instruction  component  to  other 
components  in  the  MCAP.  These  instructions  set  up  memory-to-memory  pipelines  that  route 
and  process  the  data  according  to  the  algorithm  to  be  performed. 

All  MCAP  connections  are  unidirectional  and  asynchronous.  There  are  tw'o  types  of  con¬ 
nections  within  an  MC.4P,  instruction  connections  and  data  connections.  A  data  connection 
consists  of  a  data  bus  and  a  Request/Acknowledge  (Req/Ack)  pair.  A  transfer  begins  when  the 
transmitting  component  puts  the  data  on  the  data  bus  and  activates  the  Req  line.  When  the 
receiving  component  recieves  the  Req  signal  and  accepts  the  data,  it  pulses  the  Ack  line.  The 
transfer  is  complete  and  the  data  and  Req  signals  are  dropped  when  the  Ack  pulse  is  received 
by  the  transmitting  component.  An  instruction  connection  is  similar  except  that  it  may  be 
connected  in  more  than  one  receiving  component  and  includes  an  address  bus.  An  address  is 
put  onto  this  bus  at  the  same  time  an  instruction  is  put  onto  the  data  bus.  Only  the  addressed 
component  can  receive  the  data  and  return  the  .4ck  pulse. 

The  ten  types  of  components  fall  into  four  categories  and  are  summarized  as  follows; 
Instruction  (1) 

Processor; 

Elementary-one  input,  one  output(E) 

Two-input-two  inputs,  one  output  (T) 

Comparator-two  inputs,  one  output  plus  special  outputs  (C) 

Router; 

Join-multiple  inputs,  one  output  (J) 

Fork-one  input,  multiple  outputs  (F) 

Link-multiple  inputs,  multiple  outputs  (L) 

.Memory  subsystem: 

RAM  -one  input,  one  output,  no  partitions  (R) 

Single-access-one  input,  one  output,  has  partitions  (S) 
Dual-access-two  inputs,  two  outputs,  has  partitions  (D) 
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The  letter  used  to  indicate  each  type  of  component  is  given  in  par  entheses.  These  component 
types  are  described  in  the  subsections  below. 

1.1  Instruction  Components 

An  MCAP  contains  exactly  one  instruction  (I)  component.  A  block  diagram  of  an  I  compo¬ 
nent  is  given  in  Fig.  1.2  .  As  explained  above,  an  I  component  receives  instructions  from  the 
host  and  produces  a  stream  of  external  instructions  that  are  sent  to  the  other  components  in  the 
MCAP,  The  instructions  sent  from  the  host  are  put  into  the  I  components  RAM.  These  instruc¬ 
tions  are  then  brought  in  through  the  input  instruction  queue  and  decoded.  They  include  the 
parameter  values  needed  by  the  algorithm  to  be  executed  and  cause  these  values  to  be  put  into 
the  register  set.  The  last  instruction  received  from  the  host  causes  the  subroutine  that  executes 
the  algorithm  to  be  initiated.  The  instructions  in  the  subroutine  are  then  brought  in  through  the 
input  instruction  queue.  There  are  two  types  of  instructions,  internal  instructions  and  external 
instructions.  The  internal  instructions,  the  parameter  values  in  the  register  set  and  the  internal 
logic  of  the  I  component  are  used  to  produce  other  parameter  values  that  are  also  stored  in  the 
register  set.  External  instructions  are  those  that  are,  perhaps,  modified  by  the  parameters  in 
the  register  set  and  sent  out  to  the  other  MCAP  components  via  the  output  instruction  queue 
and  instruction  connection.  Internal  instructions  are  capable  of  performing  integer  arithmetic 
and  logical  operations,  subroutine  calls  and  returns,  looping,  unconditional  branches,  and  con¬ 
ditional  branches  based  on  flag  signals  received  from  the  processing  components.  One  of  the 
processing  components  must  be  a  comparator.  As  discussed  below,  a  comparator  is  capable  of 
determining  a  maximum  or  minimum  of  a  sequence  of  numbers  as  w^ell  as  comparing  tw'o  values. 
Therefore,  not  only  are  flag  signals  sent  from  the  comparator  that  indicate  >,<,>,  <.  =  and 
7^,  but  the  comparator  may  also  return  the  index  within  a  sequence  of  a  maximum  or  minimum 
values.  Such  indices  may  be  used  to  modify  external  instructions,  particularly  the  instructions 
for  computing  addresses. 

In  addition,  the  I  component  receives  flags  from  the  other  processing  components  that  indi¬ 
cate  exceptional  conditions  (e.g.,  division  by  zero,  overflows,  underflows  and  so  on).  These  flags 
are  used  to  interrupt  the  host  via  the  control  bus  when  the  MCAP  aborts  an  algorithm. 
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1.2  Processor  Components 


The  processor  components  are  for  performing  unary  and  binary  arithemtic/logic  operations. 
There  are  three  types  of  processor  components.  There  are  one-input  elementary  (E)  components, 
two-input  (T)  components  ,  and  comparator  (C)  components.  Figure  1.3  gives  a  block  diagram, 
programmable  register  summary,  and  mode  register  definition  for  an  E  component.  As  seen 
from  the  block  diagram  an  E  component  has  an  input  instruction  connection,  an  input  data 
connection,  an  output  data  connection,  and  a  set  of  flag  lines.  It  consists  of  an  input  queue 
at  Its  input  connection,  process  logic  for  performing  the  required  operation,  and  control  logic 
that  contains  four  programmable  registers.  The  mode  register  dictates  the  actions  taken  by 
the  component  and  it,  along  with  the  other  three  registers,  determine  the  order  and  manner  in 
which  the  data  inputs  are  used  and  when  the  component  must  input  more  instructions. 

.A  list  of  the  currently  available  modes  for  an  E  component  are  given  in  Table  1.1.  Eight  bits 
in  an  E  component’s  Mode  register  are  used.  Bit  0  indicates  whether  are  not  the  component 
is  being  used  to  perform  an  accumulation  (e.g.,  sum  a  column  of  numbers).  This  bit  is  needed 
onl\  if  the  E  component  is  part  of  a  pipeline  that  can  perform  accumulations.  When  this  bit  is 
1  the  number  in  the  NumOpsOut  register  is  automatically  modified  according  to  the  number  of 
stages  in  the  pipeline,  thereby  taking  into  account  the  final  accumulation  steps.  Bits  1  through 
4  indicate  the  components  current  function  with  bit  4  specifying  whether  an  unary  or  a  binary 
operation  is  to  be  conducted.  If  an  E  component  is  used  for  a  binary  operation,  the  first 
operand  is  transferred  to  a  latch  and  the  operation  proceeds  wTen  the  second  operand  arrives. 
It  is  possible  for  an  E  component  to  output  a  constant  or  perform  a  binary  operation  with  one 
of  the  operands  being  held  constant.  Whether  a  constant  is  being  used  is  determined  by  bit  5. 
If  a  constant  is  being  is  used  it  is  possible  to  fill  the  latch  by  using  an  immediate  instruction  or 
by  using  the  first  datum  that  arrives  at  the  input  data  queue.  Bit  6  is  used  to  indicate  one  of 
these  two  choices.  Bit  7.  8  and  9  are  not  used.  Bit  10  allows  an  E  component  to  be  put  into 
primitive  mode.  In  primitive  mode,  an  E  component  does  not  accept  additional  instructions, 
but  is  simply  a  passive  component  that  inputs  and  processes  data  as  it  becomes  available. 

Except  for  an  E  component  in  primitive  mode  or  an  I  component,  from  the  standpoint  of 
any  component  an  algorithm  is  broken  into  tasks  and.  for  each  task,  the  component  must  receive 
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a  sequence  of  instructions  that  determines  what  it  is  to  do  to  execute  the  teisk.  When  the  com¬ 
ponent  has  completed  a  task,  it  must  input  another  sequence  of  instructions.  Instructions  are 
first  used  to  fill  the  Mode,  Number  of  Repetitions  (NumRepetitions)  and  Decrement  Amount 
(DecAmt)  registers.  (If  one  or  more  of  these  registers  is  not  filled,  it's  current  contents  are 
used).  Then,  when  an  instruction  fills  the  Number  of  Operands  Out  (NumOpsOut)  register,  the 
component  begins  executing  the  task.  The  Number  of  Operands  Out  Constant  (NumOpsOut- 
Const)  register  is  automatically  loaded  from  the  NumOpsOut  register.  As  the  task  executes, 
it  draws  inputs  from  it’s  input  data  queue(s),  processes  them,  and  outputs  any  results.  Each 
time  there  is  an  output,  the  NumOpsOut  register  is  decremented.  Each  time  the  NumOpsOut 
register  becomes  0  the  NumOpsOutConst  register  is  decremented  by  DecAmt  and  used  to  reload 
the  NumOpsOut  register.  Also,  the  NumRepetitions  register  is  decremented  by  1.  When  both 
NumOpsOut  and  NumRepetitions  become  0  the  task  is  complete  and  the  component  must  input 
another  input  sequence  of  instructions. 

Figure  1.4  gives  the  block  diagram,  programmable  register  summary,  and  mode  register 
definition  for  a  T  component.  It  is  similar  to  an  E  component,  but  the  T  component  has  two 
input  data  connections  and  corresponding  queues.  A  list  of  the  currently  available  modes  for 
a  T  component  are  given  in  Table  1.2.  Ten  bits  in  a  T  component’s  Mode  register  are  used, 
with  the  first  seven  serving  the  same  purposes  cis  in  an  E  component.  However,  if  a  constant 
is  input  through  one  of  the  input  data  conections,  the  connection  it  arrives  on  is  indicated  by 
bit  7.  Bit  8  specifies  whether  one  or  both  inputs  are  to  be  used,  bit  9  specifies  which  input  any 
nonconstant  input  value  will  arrive  on. 

Figure  1.5  gives  the  block  diagram,  programmable  register  summary  and  mode  register 
definition  for  a  C  component.  A  C  component  is  a  T  component  that  has  two  special  sets  of 
lines  connecting  it  to  the  I  component.  There  can  be  only  one  C  component  in  an  MCAP.  As 
usual,  its  current  function  is  determined  by  its  mode.  A  list  of  the  currently  available  modes 
for  a  C  component  are  given  in  Table  1.3.  One  of  its  functions  is  to  simply  compare  two  inputs 
and  set  relational  flags  that  are  then  transmitted  to  the  I  component  over  one  set  of  the  special 
lines.  When  performing  comparisions  there  are  no  outputs  other  than  the  flag  outputs.  The  C 
component  can,  however,  also  determine  the  maximum  or  minimum  of  a  sequence  of  numbers. 
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In  this  case,  the  second  set  of  special  lines  is  used  to  transmit  the  position,  or  index,  of  the 
maximum  or  minimum  within  the  sequence  to  the  I  component.  If  the  maximum  or  minimum 
occurs  more  than  once  in  the  sequence,  the  index  always  points  to  the  first  occurence.  If  a 
maximum  or  minimum  is  being  determined,  then  NumOpsOut  is  used  to  specify  the  length  of 
the  input  sequence  (i.e.,  NumOpsOut  really  indicates  the  number  of  operands  input).  Also,  the 
maximum  or  minimum  is  output  on  the  output  data  connection  at  the  same  time  its  index  is 
output  on  the  index  lines. 

1.3  Routing  Components 

Routing  components  are  for  directing  data  along  the  proper  paths.  There  are  three  types  of 
routing  components,  join  (J)  components  with  more  than  one  input  and  one  output,  fork  (F) 
components  with  one  input  and  more  than  one  output,  and  link  (L)  components  with  more  than 
one  input  and  more  than  one  output. 

Figure  1.6  shows  the  block  diagram,  programmable  register  summary,  and  mode  register 
definition  for  a  J  component.  It  has  an  input  instruction  connection,  multiple  input  data  con¬ 
nections  and  an  output  data  connection.  It  consists  of  a  queue  for  each  input,  a  bus,  control 
logic  containing  the  same  individual  programmable  registers  as  a  processing  component  and  a 
programmable  set  of  input  pattern  registers  (InPattern).  The  registers  in  a  pattern  are  summa¬ 
rized  in  Fig.  1.7.  The  .NumRepetitions,  Dec.\mt.  NumOpsOut  and  NumOpsOutConst  registers 
are  used  as  they  are  in  the  processor  components.  The  sole  purpose  of  the  Mode  register  is 
to  distinguish  between  no  accumulation  and  accumulation.  As  with  processor  components.  J 
and  F  components  may  be  part  of  a  pipeline  capable  of  performing  accumulation.  .4  processor 
component  pipeline  that  performs  accumulations  always  has  a  J  component  at  one  of  its  input 
data  connections,  an  F  component  at  its  output  data  connection,  and  a  feedback  data  connec¬ 
tion  from  the  F  component  to  the  J  component.  These  F  and  J  components  must  increase 
NumOpsOut  by  an  amount  that  depends  on  the  number  of  stages  in  the  pipeline.  Also,  the 
F  component,  which  normally  outputs  to  the  feedback  connection  must  output  the  final  result 
to  a  different  output  connection.  This  allow’s  the  pipeline  to  accumulate  partial  results  and 
then  produce  a  final  result  and  send  the  final  result  to  its  destination.  In  the  no  accumulation 
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mode,  InPattern  specifies  the  order  in  which  the  input  connections  are  to  be  selected  during  the 
execution  of  the  algorithm  being  programmed.  This  pattern  is  continually  cycled  through  until 
NumOpsOut  becomes  0.  It  may  be  reused  by  the  next  task  or  changed  by  a  new  instruction 
that  resets  InPattern. 

A  pattern  includes  two  subcycles.  The  Pattern  Array  (PArray)  set  of  registers  shown  in 
Fig.  1.7  are  set  to  a  sequence  of  numbers  that  indicate  input  connections.  The  register  Numberl 
indicates  the  number  of  connections  selected  during  the  first  subcycle  and  the  register  Count  1 
indicates  the  number  of  registers  from  PArray  that  are  to  be  included  in  the  first  subcycle. 
.Number2  and  Count2  serve  the  same  purpose  for  the  second  subcycle.  When  the  first  subcycle 
is  exhausted,  the  second  subcycle  is  begun,  and  when  the  second  subcycle  is  exhausted,  a  return 
is  made  to  the  first  subcycle  and  so  on. 

For  example,  if  Numberl=4,  Countl=3,  Number2=5,  Count2=2  and  PArray  contains  2,  6, 
4,  7  and  1,  then  the  input  connections  are  used  in  the  order 

2  64  2  71717  264  2  71717  .... 

The  sequence  continues  until  NumOpsOut  becomes  0. 

Figures  1.8  and  1.9  correspond  to  the  F  and  L  components,  respectively.  .A.n  F  component, 
because  it  has  only  one  input  and  multiple  outputs,  has  a  set  of  output  pattern  registers, 
OutPattern.  instead  of  an  input  set.  and  an  L  component  has  both  an  InPattern  and  OutPattern. 
Both  InPattern  and  OutPattern  are  continually  cycled  through  and  determine  the  order  in  which 
the  input  or  output  connections  are  used.  An  L  component  cannot  be  part  of  a  pipebne  and 
does  not  need  a  mode  register. 

In  addition  both  F  and  L  components  contain  a  set  of  broadcast  registers  denoted  BcPattern. 
If  an  entry  in  an  output  pattern  is  all  ones,  then  the  entry  will  not  be  used  as  an  output 
connection  number,  but  there  will  be  a  simultaneous  output  to  all  output  connections  listed  in 
BcPattern. 
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1.4  Memory  Subsystem 


There  are  three  types  of  memory  subsystem  comoponents,  RAM  (R)  components,  single 
access  (S)  components,  and  dual-access  (D)  components.  .All  memory  subsystem  components 
are  fcr  automatically  retrieving  operands  from  and  storing  results  in  their  associated  memory 
modules.  All  memory  subsystem  components  have  an  output  data  connection  and  an  input 
data  connection.  Therefore,  they  must  be  capable  of  handling  both  an  output  data  stream  and 
an  input  data  stream.  In  addition,  a  D  component  includes  a  second  pair  of  input  and  output 
connections.  .All  memory  subsystem  components  have  a  queue  in  each  of  their  input  and  output 
data  streams. 

A  significant  difference  between  the  memory  subsystem  components  and  the  other  compo¬ 
nents  is  that  a  Number  of  Operands  In  (NumOpsIn)  register  as  well  as  a  NumOpsOut  register 
must  be  included.  The  NumOpsIn  register  serves  the  same  purpose  for  the  input  data  stream 
as  NumOpsOut  does  for  the  output  stream.  Both  NumOpsIn  and  NumOpsOut  must  be  zero 
before  new  instructions  can  be  distributed  to  the  component's  programmable  registers. 

Figure  1.10  gives  the  block  diagram,  programmable  register  summary,  and  mode  register 
definition  for  the  R  component.  .An  R  component  is  primarily  used  for  temporary  storage  or  as 
a  large  queue.  An  R  component  has  six  modes  related  to  the  input  and  output  of  data.  They 
are: 

Data  is  only  stored  in  memory  (i.e..  only  the  input  data  stream  is  used). 

Output:  Data  is  only  retrieved  from  memory  (i.e.,  only  the  output  data  stream  is 

used). 

Input/Output:  Data  is  first  input  to  memory  and  then  output  from  memory.  The  input 

must  stay  ahead  of  the  output. 

Output/Input:  Data  is  first  output  from  memory  and  then  input  to  memory.  The  output 

must  stay  ahead  of  the  input. 

Input  and  Output:  Data  is  input  and  output  with  no  regard  as  to  which  is  done  first. 

Zero:  Put  zeros  in  all  memory  locations. 

Figure  1.11  gives  a  block  diagram  of  an  S  component.  .An  S  component  differs  from  an  R 
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component  in  that  it  may  be  connected  to  more  than  one  memory  module  and  the  memory  as 
a  whole  may  be  divided  into  partitions,  called  S  partitions,  that  consist  of  blocks  of  memory 
having  consecutive  addresses.  The  memory  modules  may  be  banked  and/or  interleaved  (i.e., 
the  high-order  address  bits  specify  the  bank  and  the  low-order  bits  specify  the  module  within 
the  bank).  The  partitions,  because  they  occupy  consecutive  addresses,  are  spread  across  the 
modules  and  may  even  encompass  more  than  one  bank. 

Because  some  S  partitions  are  used  for  outputting  from  memory(i.e.,  providing  the  output 
data  stream)  and  some  are  for  inputting  to  memory  (i.e.,  terminating  the  input  data  stream), 
there  is  a  set  of  programmable  registers  referred  to  as  the  Output  Partition  Pattern,  OutPartPat. 
that  determines  the  pattern  in  which  the  partitions  providing  the  output  stream  are  accessed. 
Likewise,  the  Input  Partition  Pattern,  InPartPat,  determines  the  order  in  which  the  partitions 
providing  the  input  stream  are  accessed.  In  both  cases  the  patterns  determine  the  partition 
sequence  in  the  same  way  the  connection  sequences  are  determined  by  the  routing  components 
(see  Fig. 1.7). 

Each  S  partition  is  accessed  as  a  circular  memory  (i.e.,  the  first  location  in  the  partition 
is  considered  to  follow  the  last  location  in  the  partiton).  A  summary  of  the  registers  that 
define  a  partition  and  are  used  to  determine  the  order  in  which  the  locations  within  a  partition 
are  accessed  is  given  in  Fig.1.12.  A  partition  is  defined  by  its  Base  register,  that  gives  its  base 
address,  and  Size  register,  that  designates  its  size.  From  Fig  1.12(b)  it  is  seen  that  each  partition 
has  a  mode  and  can  be  put  into  any  one  of  the  first  four  modes  permitted  an  R  component. 
If  the  mode  of  a  partition  allows  output,  a  window  must  be  defined  within  the  partition.  Ail 
outputs  must  be  from  within  the  window  and  if  there  are  inputs,  they  must  be  to  locations 
outside  the  window.  An  exception  occurs  when  a  partition  is  in  its  input  before  output  mode. 
In  this  case  the  window  must  be  filled  before  output  begins,  but  thereafter  the  input  must  be  to 
outside  the  window.  The  initial  base  address  of  the  window  is  the  same  as  that  of  the  partition. 
The  window's  base  is  incremented  with  each  repetition  of  a  window  pattern,  which  is  described 
below.  The  size  of  the  window  is  defined  by  the  contents  of  the  WinSize  register. 

Within  an  S  partition  the  sequence  of  input  addresses  is  generated  by 
1=0 
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While  NumOpsIn  >  0  { 

Input  address  =  B  +  I  mod  S 
Increment  I  by  I 
} 

where  B  and  S  are  the  base  and  size  of  the  partition. 

The  addresses  used  for  outputting  are  generated  according  to  the  output  pattern  defined  by 

the  Patinc,  NumRepsl,  Replncl,  NumReps2,  and  Replnc2  registers  and  the  offset  pattern.  Let 

P,  Ai,  Pi,  A2  and  P2  be  the  contents  of  these  registers  respectively,  and  OS  be  the  PArray  in 

the  offset  pattern  (see  Fig.  1.7.).  Also,  let  Nq  equal  the  initial  NumOpsOut  and  N3  equal  the 

sum  of  Numberl  and  Number2  from  the  offset  pattern.  Assuming  A^i  >  0  and  N2  >  0  then  the 

sequence  of  output  addresses  is  generated  by 

pQ  — Pi  —  p2  ~  —  A/2  =  A/3  =  0 

While  Mo  <  No{ 

P'l  =  0 

A/i  =  0 

While  A/q  <  Aq  and  M\  <  A’i{ 

P  2  =  0 
M2  =  0 

W  hile  A/q  <  Aq  and  A/i  <  A’l  and  M2  <  A2{ 

Ms  =  0 

While  Mo  <  Nq  and  A/i  <  Ay  and  M2  <  Ay  and  M3  <  A'sf 
Select  I  from  offset  pattern 

Output  address  =  P  +  (  A'o  +  P'l  +  P'2  +  OS[/]  )  mod  S 
Increment  A/q.  Mi,  M2  and  M3 
} 

IncrementK2  by  R2 

} 

Incrementh'i  by  Pj 

} 

IncrementKo  by  P 

} 

For  e.xample.  suppose  that  the  offset  pattern  in 

Numberl  =4  Countl=3  Number2=Count2=0  PArray=  {0,3,1} 

and 

B=0  S  =  12  A'o  =25  P=1  A'i=20  P,=2  A'2=7  P2=4. 

Then  A  3=4  and  the  sequence  of  addresses  generated  would  be 
031047  5  2  5  32697  47  5  48  11  1421  5 
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External  instructions  must,  of  course,  be  sent  to  an  S  coir  ponent  to  specify  the  mode, 
define  the  S  partitions  and  specify  the  input  and  output  partition  patterns.  The  modes  for  the 
partitions  are  determined  by  the  mode  of  the  S  component.  For  each  partition  that  produces 

output,  there  must  be  external  instructions  for  specifying  the  window  size  and  overall  output 
pattern. 

The  format  of  an  S  component  mode  instruction  is  given  in  Fig.  1.13.  Bits  0  through  29  are 
divided  into  pairs  with  each  pair  specifying  the  mode  of  an  S  partition.  Bits  0  and  1  specify 
the  mode  for  partition  0  and  so  on.  When  a  mode  instruction  is  recognized,  its  lower  30  bits 
are  seperated  into  pairs  and  the  pairs  are  sent  to  the  corresponding  S  partition  mode  registers. 
Bit  30  is  put  in  Bit  0  of  the  S  component’s  mode  register.  The  format  of  this  register  is  given 
in  Fig.  1.14.  Bit  30  is  used  to  indicate  whether  only  one  data  stream  is  to  be  used  or  both  the 
input  and  output  streams  are  to  be  used. 

A  block  diagram  of  a  D  component  is  given  in  Fig.  1.15.  In  a  D  component  there  are  two 
input  streams  and  two  output  streams.  The  input  stream  logic  in  a  D  component  is  the  same  as 
in  a  S  component,  but  it  is  replicated.  Also,  the  output  stream  logic  is  the  same,  but  replicated. 
The  S  partition  logic  is  the  same  as  in  an  S  component. 

The  formats  of  the  two  D  component  mode  instructions  are  shown  in  Fig.  1.16.  The  one 
shown  in  Fig.  1.16(a)  provides  pairs  of  bits  for  specifying  the  S  partition  modes  for  partitions  0 
through  12.  These  pairs  are  in  Bits  0  through  25.  Bit  26  indicates  that  the  attached  processor 
output  (.AP  out)  stream  is  to  be  used  when  it  is  1  and  not  to  be  used  when  it  is  0.  Similarly 
Bits  2  / ,  28  and  29  indicate  the  use  of  the  attached  processor  input  (AP  in)  stream,  host  output 
(Host  out)  stream  and  host  input  (Host  in)  stream.  The  instruction  in  Fig.  1.16(b)  gives  the  S 
partition  mode  pairs  for  partitions  13  through  27.  Bits  26,  27,  28,  and  29  are  put  in  the  four 
low-order  bits  of  the  D  component’s  mode  register.  As  with  an  S  component,  a  mode  instruction 
distributes  the  mode  bits  to  the  corresponding  S  partition  mode  registers.  The  format  of  this 
register  is  given  in  Fig. 1.17. 
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1.5  Example  Architechture 


An  MCAP  for  performing  matrix  operations  is  given  in  Fig.  1.18.  The  letter  in  each 
component  gives  the  component’s  type.  The  MCAP  includes  a  comparator  (a  C  component) 
a  negator/reciprocator  (an  E  component),  three  four-stage  adder/subtractors  (a  T  component 
followed  by  three  E  components),  three  four-stage  multipliers  (a  T  component  followed  by  three  E 
components),  several  J,  F  and  L  components  for  routing  the  data  and  three  memory  subsystem 
components.  The  small  rectangles  inside  the  memory  subsystem  components  represent  data 
streams,  address  generators,  memory  buses  and  memory  modules.  The  S  component  at  the  top 
is  for  temporarily  storing  data.  The  D  component  acts  as  a  buffer,  but  is  also  used  to  rearrange 
and  temporarily  store  data.  The  S  component  at  the  bottom  is  the  MCAP's  controller  of  the 
host’s  main  memory  and  determines  how  the  MCAP  accesses  main  memory.  The  I  component, 
instruction  connections  and  host  s  connection  to  main  memorv  are  not  shown. 
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Tab  e  1.1:  Summary  of  E  Component  Modes 


Mode 


Description 


OxxxllOfffO 

OxxxOlOfffO 

OxxxOOOfffO 

OxxxlllfFfO 

OxxxOllfffO 

OxxxOOlfffO 


IxxxOOOfFfa 


Immediate  constant  is  output  NumOpsOut  times. 

Constant  is  input  and  then  output  NumOpsOut  times. 

Unary  operations  are  performed  on  NumOps  Out  inputs  and  NumOpsOut 
results  are  output. 

Binary  operations  are  performed  using  an  immediate  constant  with  NumOp¬ 
sOut  inputs  and  NumOpsOut  results  are  output. 

Constant  is  input  and  then  binary  operations  are  performed  using  this  input 
with  NumOpsOut  additional  inputs  to  produce  NumOpsOut  outputs. 

Binary  operations  are  performed  on  NumOpsOut  pairs  of  successive  inputs 
and  NumOpsOut  results  are  output. 

Component  is  put  into  primitive  mode.  For  each  input,  a  unary  operation 
is  performed  and  a  result  is  output. 


X  -  not  used 

f  -  function  code  bit  assigned  by  designer. 

a  -  if  component  is  part  of  an  accumulation  pipeline  it  is  1:  otherwise,  it  is  0. 


Table  1.2:  Summa:'y  of  T  Component  Modes 


Mode 


Description 


OlOllOfffO 

OOcOlOiffO 

vOOOOOfFfO 

vOOlllfffO 


vOcOllfffO 


vlcOllfffO 

vOOOOlfffO 

OlOOOlfFfO 


OlOOOlfffl 


Immediate  constant  is  output  NumOpsOut  times. 

Constant  is  input  and  t.ien  output  NumOpsOut  times. 

Unary  operations  are  performed  on  NumOpsOut  inputs  and  NumOpsOut 
results  are  output. 

Binary  operations  are  performed  using  an  immediate  constant  with  NumOp¬ 
sOut  inputs  and  NumOpsOut  results  are  output. 

Constant  is  input  and  then  binary  operations  are  performed  using  this  input 
with  NumOpsOut  additional  inputs  to  produce  NumOpsOut  outputs.  Only 
one  input  connection  is  used  and  v  and  c  are  equal. 

Same  as  previous  mode  except  that  two  inputs  connections  are  used,  one  for 
the  constant  and  one  for  the  variables,  and  v  and  c  are  complements. 

Binary  operations  are  performed  on  NumOpsOut  pairs  of  successive  inputs 
arriving  on  a  single  input  connection  and  NumOpsOut  results  are  output. 

Binary  operations  are  performed  on  NumOpsOut  pairs  of  successive  inputs 
and  NumOpsOut  results  are  output.  Both  input  connections  are  used  and 
for  each  operation  one  operand  must  arrive  on  one  connection  and  the  other 
operand  must  arrive  on  the  other  connection. 

Component  is  to  be  used  as  a  part  of  an  accumulaltion  pipeline.  Both  inputs 
are  used  as  required  by  the  pipeline. 


f  -  function  code  bit  assigned  by  designer, 

V  -  number  of  the  input  connection  to  be  used  to  input  the  variables, 
c  -  number  of  the  input  connection  to  be  used  to  input  the  constant. 


Table  1.3:  Summary  of  C  Component  Modes  Description 


Mode 


Description 


vOOllOOlfO 

vOOllOlOflO 

vOOllOllfO 

vOOOOOOOfO 

vOOOOOOlfO 

vOOOOOlOfO 

vOOOOOllfO 

OlOOOOOOfO 


Immediate  constant  is  compared  with  NumOpsOut  inputs.  The  minimum 
is  output  and  the  index  of  the  first  minimum  is  sent  to  the  I  component.  If 
the  immediate  constant  is  equaJ  to  the  minimum,  the  index  is  all  O's  and  the 
flags  indicate  that  the  immediate  constant  is  equal  to  the  minimum. 

Same  as  the  first  entry  except  that  the  maximum  is  found  instead  of  the 
minimum. 

Same  as  the  first  entry  except  that  both  the  maximum  and  the  minimum  are 
found  and  both  extrema  and  their  indices  are  output.  The  minimum  and  its 
index  are  output  first. 

A  pair  of  inputs  are  compared  and  only  the  flags  are  output.  Only  one  input 
connection  is  used. 

The  minimum  of  NumOpsOut  inputs  is  found.  The  first  minimum  is  output 
and  its  index  is  sent  to  the  I  component.  Only  one  input  connection  is  used. 

Same  as  the  fifth  entry  except  that  the  maximum  is  found  instead  of  the 
minimum. 

Same  as  the  fifth  entry  except  that  both  the  minimum  and  maximum  are 
found  and  both  extrema  and  their  indices  are  output.  The  minimum  and  its 
index  are  output  first. 

The  inputs  on  the  two  input  connections  are  compared  and  only  the  flags 
are  output. 


vlcOlOOlfO 

vlcOlOlOfO 

vlcOlOllfO 

OlOOOlOOfO 


vOOOOlOOfO 


Same  as  the  first  entry  except  that  the  constant  arrives  on  the  input  con¬ 
nection  not  used  by  the  variable  input  stream,  v  and  c  are  complements. 

Same  as  the  second  entry  except  that  the  constant  arrives  on  the  input 
connection  not  used  by  the  variable  input  stream,  v  and  c  are  complements. 

Same  2ls  the  third  entry  except  that  the  constant  arrives  on  the  input  con¬ 
nection  not  used  by  the  variable  input  stream,  v  and  c  are  complements. 

NumOpsOut  inputs  arrive  on  each  of  the  two  input  connections  and  the 
absolute  values  of  the  differences  of  the  successive  pairs  of  inputs  are  deter¬ 
mined.  These  absolute  values  are  compared  with  an  immediate  operand  and 
the  flags  indicate  whether  are  not  all  of  them  are  less  than  or  equal  to  the 
immediate  operand.  Only  the  flags  are  output. 

Same  as  the  preceeding  entry  except  that  the  NumOpsOut  input  pairs  arrive 
on  a  single  connection. 


f  -  function  code  bit  assigned  by  designer. 

V  -  nunii-'-:  of  the  input  connection  to  be  used  to  input  the  variables, 
c  -  number  of  the  input  connection  to  be  used  to  input  the  constant. 


Instructions  to  other  MCAP  components 


Free  Req  Ack 


Output 

instruction 

queue 


Index 


Register 

set 


Internal 

logic 


)From 

comparator 


Flags 


Flags 


From 

processing 

components 


Input 

instruction 

queue 


RAM 


ROM 


Instructions  from  host 


To/from  host 


Fig.1.2.  Instruction  component. 
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(b)  Programmable  resisters. 


Function 


0  Normal  mode  - 
1  Primitive  mode 
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1  Binary 


0  Constant  not  immediate — ' 
1  Constant  is  immediate 

0  Variable  operands- 
1  Constant  operand 
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-0  Normal  execute  time 
1  Extra  execute  time 


(c)  Mode  register. 

Fig.  1.3.  E  component 


Reset  Free  Req  Ack 


Req- 

Ack- 

Data 


> 


Req- 

Ack- 

Data 


Q 

u 

E 

U 

E 


Q 

U 

E 

U 

E 


A 

I 

d 

n 

d 

s 

r 

t 

Queue 


Control  logic 


Processing 

logic 


Data 


(a)  Block  Diagram. 


I 

Mode 

NumOpsOutConst 

Dec  Amt 

NuraOpsOut 

NumRepetitions 

(b)  Programmable  registers. 


Function 


Variable  input: . 
0  Input  0  used 
1  Input  1  used 


0  One  input  used  - 

1  None  or  two  inputs  used 

Constant  input: 

0  Input  0 
1  Input  1 
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-0  Normal  execute  time 
1  Extra  execute  time 


(c)  Mode  register. 


Fig.  4,  T  component 
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(c)  Mode  register. 
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Fig.  1.6.  J  component 
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Fig.  1.7,  Pattern  registers. 
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(b)  Programmable  registers. 
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Fig.  1,8.  F  component 
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Note:  Currently  the  enure  mode  register  is  not  used 


(c)  Mode  register. 


Fig.  1.9.  L  component 


(b)  Programmable  registers. 
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(c)  Mode  register. 


Fig.  1.10.  R  component 
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Fig.  1.11.  S  -Component  Block  Diagram 


Mode 


(a)  Registers 


00  Input  only 
01  Output  only 

10  Input  before  output 

11  Output  before  input 


(b)  Mode  register  format 


Fig.  1.12.  Registers  in  an  S  partition. 


Partition  modes  : 


00  Input  only 
01  Output  only 

10  Input  before  output 

11  Output  before  input 


Fig.  1.13.  S  component  mode  instruction  format. 


0  one  stream  used 


Fig.  1.14.  S  component  mode  register. 


Fig  L14  Example  MCAP  for  performing  matrix  operations. 
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(a)  DMOD  instruction 


(b)  DXMD  instruction 


Partition  modes  : 

00  Input  only 
01  Output  only 

10  Input  before  output 

11  Output  before  input 


Fig.  1.16.  D  component  mode  instruction  formats. 


Stream  usage: 
0  Not  used 
1  Used 


Fig.  1.17.  D  component  mode  register. 
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CHAPXER  2 

PROGRAMS  AND  FILES  ASSOCIATED  WITH  SIMARC 

SIMULATION  PACKAGE 


Tie  programs  in  the  simulate  architecture  (SIMARC)  system  of  programs  communicate 
through  a  collection  of  files.  The  relationship  between  the  programs  and  the  files  they  create  and 
access  is  summarized  in  Fig.  2.1.  The  text  editor  is  not  a  member  of  the  simulate  architecture 
program  set,  but  may  be  any  text  editor  capable  of  producing  an  ASCII  text  file.  The  editor 
currently  used  is  the  DOS  program  EDIT. 

The  files  used  by  the  system  may  be  given  any  legitimate  DOS  filenames,  but  must  have 
the  extensions  shown  in  the  figure.  These  files  are  classified  as  follows: 


-Architecture  file  (.S.AR)  - 

contains  a  complete  description  of  an  architec¬ 
ture 

-Architecture  error  file  (.SAE)  - 

contains  connection  errors  made  while  generating 
an  architecture 

Information  file  (-SIF)  - 

contains  a  textual  description  of  the  architecture. 

Program  file  f.S.AS)  - 

contains  a  source  program  in  the  form  needed  by 
the  assembler. 

List  file  (.SLT)  - 

contains  a  list  of  the  errors  produced  during  an 
assembly. 

Load  file  (SLD)  - 

contains  a  program  in  the  form  needed  by  the 
simulator. 

Result  file  (SRT)  - 

contains  the  results  produced  by  a  simulation. 

Irregular  intervals  file  (.SIR)  - 

specifies  the  time  at  which  results  are  to  be  taken 
w^hen  the  result  intervals  are  irregular. 

Fractional  result  file  (.SET)  - 

contains  the  same  information  as  an  .SRT  file 
except  that  the  results  are  given  as  percentages. 

Summary  file  (.SUM)  - 

contains  automatically  updated  summarv  infor¬ 
mation  from  succesive  tests. 

programs  are; 

EDITOR  - 

for  creating  a  new  architecture  or  changing  an 
existing  architecture. 
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ARCHCHh  -  checks  an  architecture’s  connections  for  errors.  It 

is  automatically  executed  each  time  a  .SAR  file 
is  created  or  updated. 

-  for  assembling  a  source  program  into  a  load  form 

that  can  be  used  by  the  simulator  (SIMULATE). 

SIMI  LATE  -  for  simulating  a  given  program  on  a  given  archi¬ 

tecture  and  accumulating  results  of  the  simula¬ 
tion. 

DISPLAY  -  for  displaying  the  results  of  a  simulation  on  the 

monitor. 

SIMSL  MM  -  for  producing  summary  information  as  well  as 

storing  the  results  as  percentages 

Each  type  of  file  is  described  in  a  subsection  given  below.  Each  subsection  defines  the 
content  and  format  of  one  of  the  file  types. 


2.1  Architecture  File 

An  architecture  (.SAR)  file  completely  describes  the  architecture  of  an  MCAP  and  provides 
the  graphics  information  needed  to  display  the  architecture  and  mnemonics  required  by  the 
assembler.  An  MCAP  consists  of  a  standard  set  of  connections  and  components  in  which  each 
end  of  each  connection  is  attached  to  exactly  one  component.  An  MCAP  is  completely  defined 
by  specifying  the 

•  Connection  type  for  each  connection  and  the  component  that  is  attached  to  each  end  of 
each  connection. 

•  Component  type  and  the  attributes  of  each  component. 

.\n  architecture  file  is  an  ASCII  file  that  is  divided  into  two  parts.  The  two  parts  are 
separated  by  a  string  consisting  entirely  of  one  or  more  asterisks.  The  first  part  is  for  storing 
connection  information  that  can  be  easily  used  to  modify  an  existing  architecture.  The  second 
part  contains  the  information  needed  by  the  assembler  and  simulator. 

The  second  part  of  the  architecture  file  is  broken  into  fields,  one  field  for  each  component 
in  the  MC.\P.  The  first  part  of  a  field  contains  the  information  that  is  common  to  all  com.ponents 
and  is  summarized  in  Fig.  2.2.  The  remaining  information  in  a  field  is  determined  by  the 
component's  type.  Figures  2. .3  through  2.9  summarize  this  information  for  each  type.  In  all  of 
these  figures  the  data  type  for  each  entry  or  subfield  is  given  in  parentheses.  Except  for  the 
type,  mnemonic  and  component  number,  the  data  corresponding  to  each  major  item  in  these 
figures  is  on  a  single  line  in  the  architecture  file,  even  if  it  is  a  single  number. 
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2.2  Information  File 


fh”.  “  '“ily  description  of  an  architecture.  It  has 

the  same  «!»  name  as  the  architecture  file  that  contains  the  architecture  it  describes  and  consists 
of  a  listing  of  the  components  and  their  attributes. 


2.3  Architecture  Error  File 


archit^ture  errors  (.SAE)  file  contains  descriptions  of  the  connection  errors  made  while 
using  EDITOR  to  create  an  architecture  and  a  summary  of  all  of  the  architecture’s  connections 
It  has  the  same  file  name  as  the  architecture  file  containing  the  errors 


2.4  Program  Source  File 


A  progpm  source  (.SAS)  file  is  an  ASCII  file  in  which  each  line  is  blank,  a  remarks  line,  or  an 
assembler  language  directive  or  instruction  for  an  MCAP.  .4t  present,  there  are  three  directives 
hey  are  given  in  Fig.  2.10.  In  an  EQU  statement  the  symbol  Aamc  is  assigned  the  value 
Constant.  Remarks  is  optional  and  can  be  any  string  of  text.  If  Remarks  is  not  present  then 

If^th^n  ^  semicolon,  in  which  case  the  reminder 

of  the  line  can  be  any  text.  PROC  and  ENDP  mark  where  instructions  are  placed  that  make 

up  procedures.  An  EQL  statement  can  not  be  placed  inside  the  directive  pair  PROC...ENDP 
and  an  instruction  can  not  be  placed  outside  them. 


tpr  1  •  instructions,  internal  instructions  and  external  instructions  In¬ 

ternal  instructions  are  those  that  are  executed  entirely  within  the  instruction  component  and 

external  instructions  are  those  that  are  distributed  to  and  executed  by  other  non-memorv  com¬ 
ponents.  Instructions  have  the  format:  uiemory  com 

Label:  Mnemonic  Component  Operand,..., Operand  ; Remarks 

where  Label  the  colon,  the  semicolon,  and  Remarks  are  optional,  but  if  Label  is  present  then  the 
colon  must  be  present  and  if  Remarks  is  present  then  the  semicolon  must  be  present.  Component 
appears  in  external  instructions  only.  Mnemonic  and  Component  must  be  followed  by  afleast 
one  space  character.  Operands  must  be  separated  by  a  comma  or  at  least  one  space  character. 

LaM  is  a  string  for  identifying  the  instruction.  Mnemonic  indicates  the  instruction 
type  and  Componenf  IS  a  mnemonic  that  indicates  the  component  that  ultimately  decodes  and 

t?re"of1nstru\w 

The  internal  instructions  are  summarized  in  Fig.  2.11.  They  include  instructions  for 
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implementing  subprograms  and  loops,  manipulating  the  data  in  the  instruction  component's 
registers,  and  halting  the  computer.  Also,  a  no-operation  instruction  is  included. 


The  external  instructions  are  summarized  in  Fig.  2.12.  They  are  for  putting  values 

into  the  various  registers  in  the  programmable  components  (i.e..  the  E,  T,  C  J  F  L  R  S  and 

D  components).  These  registers  determine  the  activities  within  their  respective  components. 

uring  the  execution  of  an  algorithm,  the  values  in  these  registers  designate  the  number  of 
operands  out,  number  of  operands  in,  mode,  and  immediate  operands;  input  and  output  patterns 
for  routing:  and  memory  partitioning  and  access  patterns  for  memory  controllers.  The  value  put 
in  one  of  these  registers  may  be  immediate  (i.e,,  the  value  indicated  bv  the  operand)  or  be  in 
the  instruction  component's  register  whose  ID  is  indicated  by  the  operand.  A  register  ID  begins 
with  an  asterisk  and  ends  with  up  to  five  digits. 


2.5  List  File 


.  ist  (.bLTj  file  consists  of  a  simply  listing  of  the  errors  produced  during  the  assemblv  process. 
It  has  the  same  file  name  as  the  source  program  file  that  is  assembled.  The  format  for  the  list 
file  consists  of  the  error  messages  for  each  line  containing  an  error  followed  bv  the  line  itself 


2.6  Load  Files 


•  (-SLD)  iJe  IS  an  ASCII  file  that  is  in  the  program  format  needed  by  the  simulator.  As 

indicated  in  Fig.  2.13,  the  first  line  corresponding  to  an  instruction  begins  with  the  instruc 
tion  s  opcode,  the  type  of  component  that  is  to  receive  and  execute  the  instruction,  and  the 
component  s  number.  The  opcodes  for  the  instructions  are  given  in  Fig.  2.14. 

This  information  is  followed  by  a  sequence  of  numbers  that  are  the  operand^  If  an 
instruction  pertains  to  a  particular  partition,  then  the  first  of  these  numbers  identifies  the  parti- 

nPVi  ntpf ^  partition  are  SOSP,  SPNI,  SPBS,  SWIS.  DOSP 

DPM  DPBS,  and  DWIS  (see  Fig.  2.12  in  Sec.  4).  If  an  instruction  includes  an  indeterminate 
number  of  operands,  then  the  number  preceding  the  indeterminate  set  of  operands  indicates  the 
number  of  operands  in  the  set.  For  JSIP,  FSOP.  LSIP.  LSOP.  SIPP,  SOPP  DIPP  DOPP  DPPI 
and  DPPO  two  sets  of  numbers  are  used  to  specify  a  pattern.  In  each  set  there  is  a  sequence  of 
Items  of  indeterminate  length  preceded  by  the  number  of  items  in  a  sequence,  which  is  in  turn 
preceded  by  the  total  number  of  times  the  items  of  the  sequence  that  are  used  before  the  other 

ISBP  and  LSBP  include  only  one  sequence  but  in  order  to  make  them  consistent  with  the  other 
pattern  instructions,  two  O’s  are  appended  to  them.  Fig.  2.15  gives  a  complete  description  of 
the  operands  for  the  various  instructions. 

Note  that  for  the  SPNI  and  DPNI  instruction  the  last  four  operands  or  the  last  two 


4 


operands  may  be  omitted  in  the  source  code.  If  omitted,  these  operands  are  set  to  0  in  the  load 
code. 

In  Sec.  4  it  was  indicated  that  an  operand  may  be  immediate  or  the  ID  of  an  instruction 
component  register.  If  an  operand  is  a  register  ID,  then  the  corresponding  operand  in  the  load 
file  will  begin  with  an  asterisk  and  end  with  up  to  five  digits.  The  five  digits,  of  course,  identify 
the  register. 


2.7  Result  File 

The  result  (.SRT)  file  is  an  ASCII  file  that  contains  the  results  of  a  program  simulation.  As 
indicated  by  Fig.  2.16(a),  a  result  file  consists  of  lines  with  the  first  line  containing  the  filename  of 
the  tested  architecture.  The  second  line  indicates  the  number  of  components  in  the  architecture. 
The  third  line  indicates  whether  regular  or  irregular  intervals  were  used  to  collect  the  results, 
the  fourth  gives  the  total  area  and  the  fifth  provides  the  column  headings.  Column  headings  are 
included  to  make  the  file  more  readable.  The  remainder  of  the  file  consists  of  a  field  for  each 
interval  for  which  results  were  recordered. 

.A.S  shown  in  Fig.  2.16(b),  each  field  begins  with  the  system  time  at  which  the  data  was 
taken.  In  the  remainder  of  a  field  is  one  or  more  lines  for  each  component.  Fig.  2.16(c)  gives 
the  format  of  the  lines.  Each  line  corresponds  to  a  component  or,  for  I,  R,  S  and  D  components, 
a  set  of  logic  with  a  component.  Each  line  begins  with  the  component’s  mnemonic  followed  by 
the  area  of  the  component  or  set  of  logic  for  wich  the  data  was  collected.  The  area  is  followed 
by  the  current  accumulated  times  the  component  or  set  of  logic  has  spent  in  each  state  since  the 
beginning  of  the  program  and  the  maodmum  number  of  entries  in  the  component’s  instruction 
and  data  queues  since  the  beginning  of  the  program.  Entries  that  do  not  apply  to  a  component 
or  set  of  logic  are  filled  with  O’s. 

The  possible  states  that  a  component  can  be  in  are  defined  below.  A  component  is  in 
exactly  one  of  these  states  at  a  time.  Some  types  of  components  can  assume  any  one  of  these 
states  at  a  given  time,  while  others  can  assume  only  some  of  the  states.  The  component  types 
that  can  take  on  a  state  are  given  in  parentheses. 

BL  S\  (all  component  types)  -  the  component  is  actively  performing  one  of 

its  functions. 

(all  component  types)  -  the  component  is  waiting  for  its  output  to  be 

taken  or,  for  the  I  component,  a  flag  to  be  set. 

IDLE  (I,E.T,C..J,F.L,R,S.D)  -  the  component  is  waiting  for  input. 

FREl  (ail  component  types)  -  the  component  is  completely  inactive. 

DIST  (E,T.C,J,F,L,R,S,D)  -  the  component  is  distributing  instructions  to 

its  registers. 
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For  the  S  and  D  components,  only  the  input  stream  logic  and  output  stream  logic  can 
take  on  all  five  states,  the  address  generator,  buses  and  memory  modules  can  be  only  FREE 
or  BUSY.  Also,  depending  on  the  component  type,  the  state  time  data  may  be  followed  by  the 
ma-Kimum  numbers  of  entries  in  the  component's  instruction  and  data  queues.  The  component 
t.vpes,  I,  B.  E,  T,  C,  J.  F,  L,  R,  S  and  D  have  instruction  queues  and  the  component  types  E, 
T.  C.  J.  F.  L,  R,  S  and  D  have  input  data  queues.  Component  types  J  and  L  have  multiple 
input  data  queues  and  only  the  maximum  number  of  entries  taken  over  all  queues  is  recorded. 
Component  types  R  and  S  also  have  output  data  queues  and  D  components  have  two  input 
queues  and  two  output  queues.  If  a  component  does  not  have  an  instruction  or  data  queue,  then 
a  0  is  entered  as  the  corresponding  data. 

Results  can  be  saved  at  regular  or  irregular  intervals.  Regular  intervals  are  used  to 
specify  saving  results  every  multiple  of  a  value.  For  example,  if  a  regular  inter\-al  of  1000  is 
specified,  results  will  be  saved  at  1000,  2000.  3000,  etc.  up  to  the  last  multiple  of  1000.  In 
addition,  the  results  are  saved  at  the  end  of  the  simulation  if  they  weren't  previously  saved  as 
a  multiple  of  the  regular  interval.  If  intervals  other  than  a  multiple  of  some  specified'  value  are 
required,  irregular  intervals  must  be  used. 

Irregular  intervals  specify  the  exact  point(s)  at  which  results  are  to  be  sav'ed.  As  many 
as  100  of  these  values  may  be  entered  for  any  simulation.  In  addition,  the  results  are  saved  at 
the  end  of  the  simulation  unless  this  would  be  a  duplication  of  the  last  interval  specified. 


2.8  Irregular  Intervals  File 

\\  hen  the  user  chooses  to  record  results  at  irregular  intervals,  the  system  times  for  recording 
the  results  may  be  entered  through  the  keyboard  or  taken  from  an  irregular  intervals  (.SIR)  file, 
whose  format  is  shown  in  Fig.  2.17.  The  file  simply  consists  of  the  times  followed  by  a  0,  which 
serves  as  a  terminator.  If  the  user  chooses  to  enter  the  times  through  the  keyboard,  then  he  or 

she  may  also  choose  to  store  the  times  in  an  .SIR  file  so  that  they  can  be  used  later  without 
re-entering  them. 


2.9  Fractional  Result  File 


A  fractional  result  {.SFT)  file  contains  the  same  information  and  has  the  same  format  as  the 
result  (.SRT)  file,  except  that  the  times  spent  in  each  state  are  replaced  by  percentages  of  times 
spent  in  each  state.  For  e.xample.  if  the  system  time  were  1500  and  the  time  spent  in  the  BUSY 
state  were  900.  then  60  would  replace  900  in  the  result  data. 
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2.10  Summary  File 


A  summary  (.SUM)  file  is  for  storing  BUSY  and  sustainable  speed  data  from  a  sequence  of  tests. 
Each  time  the  file  is  accessed  the  user  is  asked  to  enter  a  comment  through  the  keyboard  and 
the  data  needed  to  compute  the  BUSY  and  speed  information  is  extracted  from  the  specified 
result  (.SRT)  file.  The  BUSY  and  speed  information  is  then  appended  to  the  .SUM  file  using 
the  format  given  in  Fig.  2.18. 
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*  package  uses  the  DOS  program  EDIT,  any  text  editor  capable  of  producing  an 

ASCn  file,  could  be  used  to  produce  a. SAS  file.  f  c 


Fig.  2.1  Relationship  between  the  SIMARC  package’s  programs  and  files. 
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*Type  (a  single  character  I,  B,  M,  E,  T.  C,  J,  F,  L,  R,  S  or  D) 
"Mnemonic  (string  of  four  letters  and  digits) 

"Component  number  (positive  integer) 

Number  of  rectangles  used  to  display  component  (positive  integer) 
""An  array  of  quadruplets,  each  specifying  a  rectangle  as  follows: 

Left  (non-negative  integer) 

Top  (non-negative  integer) 

Right  (non-negative  integer) 

Bottom  (non-negative  integer) 

A  pair  specifying  position  of  component’s  displayed  text 
X  position  (non-negative  integer) 

\  position  (non-negative  integer) 

Text  size  (non-negative  integer) 

Text  orientation  (positive  integer) 

Execution  time  (non-negative  integer) 

‘Type  must  be  the  first  character  on  a  line  and  Type.  Mnemonic  and 
component  ID  constitute  a  line. 

*Size  of  array  is  the  number  of  rectangles. 

Fig.  2,2  Attributes  common  to  a.11  components 
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Instruction  queue  size  (positive  integer) 
Memory  Time  (positive  integer) 

Area  (positive  integer) 

Memory  Area  (positive  integer) 

(a)  I  component 


Instruction  queue  size  (positive  integer) 
Area  (positive  integer) 

(b)  B  component 


Fig.  2.3  Additional  features  for  the  I  and  B  components. 
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Alternate  execution  time  (non-negative  integer) 

Distribution  time  (positive  integer) 

Number  of  stages  in  accumulation  pipeline  (non— negative  integer) 
Input  connection  (positive  integer) 

Output  connection  (positive  integer) 

Instruction  queue  size  (positive  integer) 

Data  queue  size  (positive  integer) 

Area  (positive  integer) 

"^"^Additional  input  connection  (positive  integer) 

*0  if  component  is  not  part  of  an  accumulation  pipeline 
**Included  in  T  or  C  component  only. 


Fig.  2.4  Additional  attributes  for  processor  components. 
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Distribution  time  (positive  integer) 

^Number  of  stages  in  accumulation  pipeline  (non-negative  integer) 
Number  of  input  connections  (positive  integer) 

Array  of  input  connection  numbers  (positiv’e  integers) 

Output  connection  number  (positive  integer) 

Instruction  queue  size  (positive  integer) 

Data  queue  size  (positive  integer) 

Area  (positive  integer) 

*0  if  component  is  not  part  of  an  accumulation  pipeline 
**Si2e  of  array  is  the  number  of  input  connections 


.  2.0  Additional  attributes  for  the  J  component. 


Distribution  time  (positive  integer) 

^Number  of  stages  in  accumulation  pipeline  (non-negative  integer) 
Input  connection  number  (positive  integer) 

Number  of  output  connections  (positive  integer) 

**Array  of  output  connection  numbers  (positive  integers) 

Instruction  queue  size  (positive  integer) 

Data  queue  size  (positive  integer) 

Area  (positive  integer) 

*0  if  component  is  not  part  of  an  accumulation  pipeline 
‘"Size  of  array  is  the  number  of  output  connections 


Fig.  2.6  Additional  attributes  for  an  F  component. 


Distribution  time  (positive  integer) 

Number  of  input  connections  (positive  integer) 

Array  of  input  connection  numbers  (positive  integers) 
Number  of  output  connections  (positive  integer) 

Array  of  output  connection  numbers  (positive  integers) 
Instruction  queue  size  (positive  integer) 

Data  queue  size  (positive  integer) 

Area  (positive  integer) 


*Si2e  of  array  is  the  number  of  input  connections 
'*'*^ize  of  array  is  the  number  of  output  connections 


Fig.  2.7  Additional  attributes  for  an  L  component. 
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Distribution  t  me  (positive  integer) 

Input  connection  number  (positive  integer) 
Output  connection  number  (positiv'e  integer) 
Capacity  of  memory  (positive  integer) 
Instruction  queue  size  (positive  integer) 

Data  queue  size  (positive  integer) 

Area  (positive  integer) 


Fig.  2.8  Additional  attributes  for  the  R  component. 
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I 


Input  execution  time  (non  negative  nteger) 

Distribution  time  (positive  integer) 

Memory  module  capacity  (positive  integer) 

Input  connection  number  (positive  integer) 

Output  connection  number  (positive  integer) 

Number  of  memory  modules  per  bank  (positive  integer) 
Number  of  memory  banks  (positive  integer) 

Address  time  (positive  integer) 

Memory  time  (positive  integer) 

Memory  bus  time  (positive  integer) 

Number  of  output  memory  buses  (positive  integer) 

Number  of  input  memory  buses  (positive  integer) 
Instruction  queue  size  (positive  integer) 

Data  queue  size  (positive  integer) 

Address  queue  size  (positive  integer) 

Stream  area  for  each  stream  (positive  integer) 

Mem.ory  area  for  each  memory  module  (positive  integer) 
Bus  area  for  each  bus  (positive  integer) 

Address  generator  area  for  each  generator  (positive  integer) 
Host  connection  number  in  (positive  integer) 

*Host  connection  number  out  (positive  integer) 

"^These  attributes  are  for  D  components  onlv. 


Fig.  2.9  Additional  attributes  for  the  S  and  D  components. 
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Name  EQU 


Constant 


(a)  EQU  directive. 


label:  mnemonic  operand . operand 

label:  mnemonic  operand,...,  operand 

label:  mnemonic  operand,...,  operand 


(b)  PROC . ENDP  directives. 


Fig.  2.10  Directive  summary. 
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Format 

Description 

CALL 

Label 

Subroutine  call  to  Label 

RTRN 

Subroutine  return 

LOOP 

Operand.  Label 

Repetition  count  -  Operand 

NOOP 

Branch  address  =  Label 

No  operation 

RALT 

Terminates  the  program 

WAIT 

Operand 

0  -  waits  for  all  components  to  be  free 

MOVE 

R.  Operand 

1  -  waits  for  compaiaior  to  be  free 

(R)  <-  Operand 

ADDR 

R.  Operand 

(R)  <- (R)  +  Operand 

SUBR 

R.  Operand 

(R)  <“  (R)  -  Operand 

MULR 

R.  Operand 

(R)  <-  (R)  *  Operand 

DIVR 

R,  Operand 

(R)  <-  Quotient  (R)/  Operand 

NEGR 

R 

(R+1)  <- Remainder  (R)/  Operand 

(R)  <-  -  (R) 

BRAN 

Label 

Branch  to  Label 

BREQ 

R,  Operand.  Label 

Branch  to  Label  if  (R)  =  Operand 

BRNE 

R,  Operand,  Label 

Rimc'dio  Label  if  (R)  o  Operand 

BRGT 

R,  Operand,  Label 

Branch  to  Label  if  (R)  >  Operand 

BRGE 

R.  Operand,  Label 

Branch  10  Labe!  if  (R)  >=  Operand 

BRLT 

R.  Operand,  Label 

Branch  to  Label  if  (R)  <  Operand 

BRLE 

R,  Operand,  Label 

Branch  to  Label  if  (Rj  <=  Operand 

STOP 

Operand 

Sets  a  breakpoint  at  time  specified  by  Operand 

RSET 

Resets  all  components 

Legend: 

Label 

-  instruction  label 

Operand 

-  a  constant  or  resister  ID 

-  an  instruction  component  register  number 


Fig.  2.11  Internal  instruction  summary. 
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Format 


Description 


(Type)  EMM 

(Type)NOO 

(Type)NOI 

(Type)MOD 

(Type)REP 

(Type)DEC 

JSIP 

FSOP 

FSBP 

LSIP 

LSOP 

LSBP 

SIPP 

SOPP 

SOSP 

SPNI 

SPBS 

SWIS 

DXMD 

DIPP 

DOPP 

DOSP 

DPNT 

DPBS 

DWIS 

DPPI 

DPPO 

DHNO 

DHNl 


Component  Oprd 
Component  Oprd 
Component  Oprd 
Component  Oprd  Component* 

Component  Oprd 
Component  Oprd 

Component  Wprd  In  ...  In  Wprd  In  ...  In 
Component  Wprd  Out ...  Out  Wprd  Out ...  Out 
Component  Out ...  Out 
Component  Wprd  In  ...  In 
Component  Wprd  Out ...  Out 
Component  Out ...  Out 

Component  Wprd  Part ...  Part  Wprd  Part ...  Part 
Component  Wprd  Part ...  Part  Wprd  Part ...  Part 
Component  Part  Wprd  Os  ...Os  Wprd  Os  ...Os 
Component  Part  Patinc  SRI  RII  NR2  RI2  ** 
Component  Part  PartBase  PartSize 
Component  Part  WinSize 
Component  Oprd 

Component  Wprd  Part ...  Part  Wprd  Part ...  Part 
Component  Wprd  Part ...  Part  Wprd  Part ...  Part 
Component  Part  Wprd  Os  ...  Os  Wprd  Os  ...  Os 
Component  Part  Patinc  NRl  RII  NR2  RI2  ** 
Component  Part  PartBase  PartSize 
Component  Part  WinSize 
Component  Wprd  Part ...  Part 
Component  Wprd  Part ...  Part 
Component  Oprd 
Component  Oprd 


Immediate  operand  Wprd 
Number  of  operands  out  =  Oprd 
Number  of  operands  in  =  Oprd 
Mode  =  Oprd 

Number  of  repetitions  -Oprd 

Decrement  amount  Wprd 

Set  input  pattern 

Set  output  pattern 

Set  broadcast  pattern 

Set  input  pattern 

Set  output  patem 

Set  broadcast  pattern 

Set  input  partition  pattern 

Set  output  partition  pattern 

Set  partition  offset  pattern 

Set  repetitions  and  increments 

Define  partition 

Define  window 

Extra  mode  =  Oprd 

Set  input  partition  pattern 

Set  output  partition  pattern 

Set  partition  offset  pattern 

Set  repetitions  and  increments 

Define  partition 

Define  window 

Set  partition  pattern  in 

Set  partition  pattern  out 

Number  of  host  operands  out  =  Oprd 

Number  of  host  operands  in  =  Oprd 


*  For  T  and  C  component.  Component  indicates  the  single  input  or,  for  two  inputs,  the  variable  input;  otherwise, 
this  operand  is  not  present. 


**  The  pair  NT^2,  RI2  or  both  it  and  the  pair  NRl, RII  may  be  omined,  in  which  case  each  omitted  operand  is  set  to  0, 

Note:  Wprd  may  not  be  present,  but  must  appear  twice  or  not  at  all.  If  not  present,  then  the  entire  list  is  continually 
cycled  through. 


Legend: 

Oprd 

Component 

In 

Out 

Part 

Patinc 

NRl  M2 

RIIJRI2 

PartBase 

PartSize 

Os 

WinSize 


-  integer  or,  if  preceeded  by  an  asterisk,  a  register  ID 

-  mnemonic  of  component  to  be  programmed 

-  component  mnemonic  of  data  source 

-  component  mnemonic  of  data  destination  or  &  if  broadcasting 

-  partition  number  or  register  ID 

-  pattern  increment  or  register  ID 

-  number  of  repetitions  or  register  ID 

-  repetition  increment  or  register  ID 

-  partition  base  address  or  register  ID 

-  partition  size  or  register  ID 

-  offset  in  a  partition’s  offset  pattern  or  register  ID 

-  window  size  or  register  ID 


Fig.  2.12  External  instruction  summary. 


19 


Opcode 


Component  type 


Component  ID 


Sequence  of  numbers* 


*  In  the  sequence  of  numbers  some  of  the  numbers  may  be  preceded  by  an  asterisk,  in  which  case  the  number  is  a  register  ID. 


Fig.  2.13  General  format  of  a  load  file  instruction. 
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Opcode 

Mnemonic 

Opcode 

Mnemonic 

1 

CALL 

50 

(Type)IMM 

2 

RTRN 

51 

(Type)NOO 

3 

LOOP 

52 

(Type)NOI 

4 

NOOP 

53 

(Type)MOD 

5 

HALT 

54 

JSEP 

6 

WAIT 

55 

FSOP 

7 

MOVE 

56 

FSBP 

8 

ADDR 

57 

LSIP 

9 

SUBR 

58 

LSOP 

10 

MULR 

59 

LSBP 

il 

DIVR 

60 

SIPP 

12 

NEGR 

61 

SOPP 

13 

BRAN 

62 

SOSP 

14 

BREQ 

63 

SPNI 

15 

BRNE 

64 

SPBS 

16 

BRGT 

65 

DIPP 

17 

BRGE 

66 

DOPP 

18 

BRLT 

67 

DOSP 

19 

BRLE 

68 

DPNT 

20 

STOP 

69 

DPBS 

21 

RSET 

70 

DWIS 

71 

DPPI 

72 

DPPO 

73 

(Type)REP 

74 

(Type)DEC 

75 

DHNO 

76 

DHNI 

77 

SWIS 

78 

DXMD 

(a)  Internal  instructions  (b)  External  instructions. 

Fig.  2.14  Opcode  assignments. 
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Note: 


Immediate  constant,  number  of  operands  out, 
number  or  operands  in,  mode  or  number  of  repetitions 

Number 

(a)  IMM,  NOO,  NOI,  MOD,  REP.  DEC,  DHNO  or  DHNl  instruction. 


Number  of  items  in  first  pattern 


Number  of  items  in  second  pattern 


First  pattern 


Second  pattern 


Nun^er  Number  Number  ...  Number  Number  Number  Number  ...  Number 


Number  of  items  taken  from  fu-st 
pattern  before  second  pattern  is  begun 


Number  of  items  taken  from 
second  pattern  before  first 
pattern  is  returned  to 


(b)  JSIP.  FSOP,  LSIP,  LSOP,  SIPP.  SOPP,  DIPP,  DOPP.  DPPI  or  DPPO 


instruction. 


Number  of  outputs 
in  pattern 

Number  Number 


Output  connections  in  broadcast  pattern 
given  in  order  of  pattern 


Number  ...  Number  0  0 


(c)  FSBP  or  LSBP  instruction. 


Panition  no.  ^  ^ —  Number  of  repetitions^ 

Number  Number  Number  Number  Number  Number 

nt  ^  ^  Repeti don  inaemen ^ 

(d)  SPNl  or  DPNI  instruction. 


Pattern  increment 


)Orae  A u/7i*er  fields  may  be  followed  by  an  asterisk,  in  which  case  umber  is  a  remster  ID  and  the 
conientb  of  the  register  are  to  be  used  when  the  instruction  is  executed. 


Fig.  15  Formats  of  the  Sequence  of  Numbers  field  given  in  Fig.  14 
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Partition  no.  — ^  ^ —  Size  of  partition 

Number,  Number,  Number 
Base  address  of  partition" 


(e)  SPBS  or  DPBS  instruction. 


Partition  no. 


A 


r~ 


Number,  Number 


Window  size 


(0  SWIS  and  DWIS  instruction. 


Fig,  2.15  (Continued) 
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Architecture  filename 

Number  of  components  in  system 

Type  of  interval  (Regular  Interval  or  Irregular  Interval) 

Total  area 

Column  headings 

Results  field 

Results  field 

(a)  Overall  format. 


System  time 
Line  of  data 


Line  of  data 


(b)  Results  field  format 


+Component  Area  BUSY  time 


WAIT  time  IDLE  time  FREE  time  DISTtime 


*IQ 


+  Entries  that  are  not  applicable  to  a  component  are  filled  with  O's. 

*  Maximum  number  of  entries  in  the  instruction  or  data  queue  up  to  the  present 
time.  For  multiple  inputs,  the  m3J>timum  is  taken  over  ail  data  queues. 

Note:  For  the  I  component  the  results  occupy  two  lines,  one  for  the  decode 
logic  and  one  for  the  memory.  For  R  components,  the  results  occupy 
^0  lines ,  one  for  the  input  stream  and  one  for  the  output  stream. 

For  S  and  D  components,  the  results  occupy  one  line  for  each  data 
stream,  each  address  generator,  each  bus  and  each  memory  module. 


(c)  Line  of  data  format. 


Fig.  2.16  Result  file  format. 


*DQ 
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System  time  at  end  of  interval 

System  time  at  end  of  interval 
0 

Note:  The  final  zero  serves  as  a  terminator. 

Fig.  2.17  Irregular  interval  file  format. 
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Commevi 

Percent  BUSY  for  E  a.nd  T  componentsi  Pcrccntcigc 

Average  sustainable  speed:  Value  MFLOPS 

Fig.  2.18  Format  of  a  summary  (.SUM)  file  entry. 
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DESIGN  CONSIDERATIONS  FOR  IMPLEMENTING  A 
MODULARLY  CONFIGURED  ATTACHED  PROCESSOR  IN  A 

MULTI  CHIP  MODULE 


J.  Sanjay  Singh,  Buck  W.  Gremel,  Vijay  P,  Singh,  and  Glenn  A.  Gibson 
Electrical  and  Computer  Engineering  Department 
The  University  of  Texas  at  El  Paso 
El  Paso,  Texas  79968-0523 


Abstract 

Implementation  of  a  novel  modularly  configured  at¬ 
tached  processor  (MCAP)  architecture  was  evaluated 
using  1  fim  CMOS  logic  on  an  MCM-D.  The  tran¬ 
sistor  count  was  approTimately  ten  million  transis¬ 
tors.  distributed  on  twenty-five  chip  dice.  Delay,  area, 
and  power  calculations  were  performed  using  the  SUS- 
PESS  model.  Rent’s  rule  was  found  to  be  not  ap¬ 
plicable.  Speed  was  calculated  to  be  in  the  100  MHz 
range.  The  module  foot  print  was  found  to  be  90  cm^. 
Power  dissipation  per  unit  area  was  low  enough  to  al¬ 
low  air  cooling. 

1  Introduction 

Attached  processors  [1],  [2],  [3]  are  commonly  used 
for  the  purpose  of  very  quickly  executing  most  of  the 
system's  computational  tasks.  In  such  an  organiza¬ 
tion.  ‘'the  host  is  a  program  manager  which  handles  all 
I/O,  code  compiling,  and  operating  system  functions, 
while  the  attached  processor  concentrates  on  arith¬ 
metic  computation  with  data  supplied  by  the  host'’ 
[!]■ 

In  addition  to  quick  execution,  it  is  also  desirable 
to  execute  as  broad  a  set  of  algorithms  as  possible 
in  order  to  create  a  more  generally  applicable  pro¬ 
cessor.  Thus,  the  underlying  goal  of  the  designer  is 
to  efficiently  utilize  the  hardware  for  as  broad  a  set 
of  algorithms  as  possible.  However,  for  most  current 
designs,  the  average  sustainable  execution  rates  have 
been  found  to  be  only  b%  to  20%  of  their  peak  rates. 
For  example,  the  sustainable  rates  for  the  Cray  X-MP 
with  four  processors  may  be  as  low  as  5%  for  some 
algorithms  [4],  Although  some  of  the  lost  efficiency  is 
necessitated  by  the  algorithms,  much  of  it  is  due  to 
memory  accessing  and  contention  for  shared  resources 


in  general,  including  internal  buses. 

In  this  article  we  describe  the  implementation  of  a 
novel  modularly  configured  architecture  wherein  uti¬ 
lization  of  each  processor  is  greatly  enhanced  through: 
(1)  closely  matching  their  architectures  to  the  set  of 
algorithms  they  are  to  execute,  (2)  overlapping  of 
processing  and  memory  accessing  by  using  memory 
prefetching,  (3)  minimizing  the  movement  of  data,  (4) 
using  a  high-speed  CMOS  with  one  micron  technology, 
and  (5)  having  the  whole  MCAP  on  a  single  MCM-D. 

2  Modularly  Configured  Attached 
Processor  (MCAP)  Architechture 

An  MCAP  is  an  attached  processor  that  is  con¬ 
structed  entirely  from  a  standard  set  of  connections 
and  components  [5],  [6].  This  standard  set  consists 
of  three  types  of  asynchronous  connections  and  twelve 
types  of  components.  These  component  types  are  such 
that  each  member  of  the  class  may  include  parallel 
processing,  memory  to  memory  pipelines,  and  be  con¬ 
structed  in  a  building  block  fashion.  They  encom¬ 
pass  routing  as  well  as  memory,  control  and  processing 
components.  By  overlapping  processing  with  memory 
accessing  and  matching  an  architecture  with  a  set  of 
algorithms,  the  average  sustainable  rate  for  a  specific 
set  of  algorithms  can  attain  at  least  60%  of  the  peak 
rate.  These  rules  allow  the  components  to  be  easily 
configured  in  different  ways,  thereby  allowing  the  con¬ 
struction  of  attached  processors  that  efficiently  per¬ 
form  different  sets  of  algorithms. 

An  example  architecture  is  given  in  Figure  I,  It’s 
processing  subsection  includes  a  comparator,  a  nega¬ 
tor  (elementary  component),  a  reciprocator  (elemen¬ 
tary  component),  a  set  of  four  pipelined  adders  capa¬ 
ble  of  accumulation  (via  feedback),  and  a  set  of  four 
pipelined  multipliers.  Each  adder  and  multiplier  is 


constructed  of  four  stages  (a  two-input  component  fol¬ 
lowed  by  three  elementary  components).  All  commu¬ 
nications  to  and  from  the  processing  components  are 
through  six  link  components,  three  on  each  side  of  the 
processor.  Join  and  fork  components  are  provided  to 
allow  hexible  use  of  the  link  components.  There  is  a 
dual  access  component  to  provide  intermediate  mem¬ 
ory  and  a  connection  to  main  memory.  The  single 
access  component  provides  internal  storage. 


MOST 

nCURE  l.  A>  EXAMPLE  MCAP  ARCHITECmmE 

3  Design  Considerations  and  Results 

In  our  evaluation,  CMOS  was  picked  as  the  bench¬ 
mark  logic  technology  because  of  its  commercial  ma- 
turitv.  In  the  future,  we  plan  to  evaluate  other  faster 
technologies  such  as  GaAs,  BiCMOS,  and  EC  L.  Since 
the  signal  delays  associated  with  the  conventional 
printed  circuit  board  (PCB)  implementation  are  ex¬ 
pected  to  be  prohibitively  excessive,  it  was  decided 
that  the  fabrication  of  an  MCAP  on  a  Multi  Chip 
Module  (MCM)  or  Wafer  Scale  Integration  (W’SI)  are 
the  only  realistic  alternatives  for  attaining  high  per¬ 
formance.  W'SI  integrates  an  undiced  wafer  of  defect 


tolerant  VLSI  chips  with  global  power,  clock,  and  sig¬ 
nal  distribution  networks.  MCM  technology  on  the 
other  hand  has  discrete  V'LSI  chips,  probably  of  dif¬ 
ferent  types,  mounted  on  a  substrate  that  supports 
global  power,  clock  and  signal  distribution  networks. 
Since  the  chips  are  procured  separately,  the  substrate 
can  be  optimized  and  tested  independantly  before  as¬ 
sembly.  Therefore  defect  tolerance  is  not  required. 
However,  the  “known  good  die'’  problem  is  yet  to  be 
solved.  Further,  multichip  modules  are  classified  ac¬ 
cording  to  the  substrate  technology:  MCM-Ceramic, 
MCM-Deposition,  and  MCM-Laminated.  In  this  work 
vve  used  MCM-D  for  design  evaluation.  MCM-D  man¬ 
ufacturing  processes  are  similar  to  those  used  in  the 
semiconductor  industry  and  can  be  used  to  achieve 
high  densities  and  fine  line  geometries.  In  the  phys¬ 
ical  design,  we  must  face  the  traditional  problems  in 
placement  and  routing  required  by  the  high  perfor¬ 
mance  systems.  As  the  clock  frequency  is  increased, 
we  need  to  account  for  transmission  line  effects  due 
to  long  interconnections.  Parasitics  on  the  intercon¬ 
nects,  inductances  on  the  power  lines,  and  the  I/O 
pin  limitation  are  the  three  vital  shortcomings  of  cur¬ 
rent  packaging  technologies,  which  could  be  tackled  by 
(a)  minimum  chip  to  chip  interconnections,  (b)  high 
interconnection  density,  and  (c)  parallel  architecture. 
Other  factors  w^hich  need  to  be  considered  are  ground 
and  power  plane  generation  and  physical  design  ver¬ 
ification.  The  thermal  considerations  are  a  direct  re¬ 
sult  of  the  substrate  type,  bonding  selection,  and  the 
placement  of  chip  dice. 

A  system  level  model,  referred  to  as  the  SUS- 
PENS  model  (Stanford  University  System  PErfor- 
maNce  Simulator)  [7]  is  used  to  predict  the  perfor¬ 
mance  of  the  MCAP.  This  model  emphasizes  the  inter¬ 
actions  among  devices,  circuits,  logic,  packaging,  and 
architecture.  The  same  model  could  be  used  in  future 
to  compare  logic  technologies  (e.g.  CMOS,  Bipolar, 
and  GaAs)  and  various  packaging  technologies  (e.g. 
MCM-D,  MCM-C,  PCB,  and  W^SI). 

3.1  Chip  Level  Design 

3.1.1  Transistor  Count 

To  illustrate  the  method  used  to  estimate  the  total 
number  of  transistors  in  the  example  MCAP  (Fig.  1), 
we  will  consider  one  of  the  floating  point  adders.  Each 
adder  has  four  pipelined  stages  and  uses  the  IEEE 
double  precision  standard.  Further  this  adder  could 
be  broken  down  into:  (a)  nine  64-bit  registers  with 
4032  transistors,  (b)  seventy-four  2-input  XOR  gates 
with  592  transistors,  (c)  one  hundred  and  twenty-six  2- 


to-1  MUXs  with  504  transistors,  (d)  two  ll-bit  adders 
with  528  transistors,  (e)  one  52-bit  adder  with  1248 
transistors,  (f)  a  64-bit  leading  zero  detector  with  5000 
transistors,  (g)  tw’o  52-bit  barrel  shifters  with  4000 
transistors,  and  (h)  rounding  and  other  control  logic 
taking  6500  transistors. 


ELEMENT 

DESCRIPTION 

#  TRANSISTORS 

Memorv  Element 

■  Has  4  k  each  erf  RAM  and  ROM 

1  84M 

Has  8  words  of  HFO 

10.0  K 

Bus 

Has  8  words  of  FIFO 

lO.OK 

EiemcntarN 

Has  8  wc?rds  of  FIFO 

10.0  K 

Two  input 

Has  8  words  of  FIFO 

12.0  K 

Join 

3  inputs  and  8  FIFO 

14  OK 

Fork 

3  inputs  and  8  FIFO 

7  .0  K 

Link 

4  inputs  and  5  outputs 

25  .0  K 

St2tic  Ram 

16  elements  of  Ik  each 

6.30  M 

Single  Access  Controller 

Controls  8  Memory  elements 

1!  OK 

Dual  Access  Controller  i 

Corarols  8  memory  elements 
and  6  DMA  channels 

61.0  K 

C  um  pare 

serwls  out  Hags  and  Indices 

25  OK 

Rscipaxaie 

using  Convergence  method 

50  0  K 

Negate 

inven  the  sign  bit 

I.OK 

n  Pt  Adder 

using  CL.A  S.  Barrel  shifters 

23.0  K 

n  Pt  Multiplier 

using  Modified  Booth  s  Alg. 

61.0  K 

MCAP 

Total  0!  Transistors 

9.85  Million 

MCM 

With  25  Chips  and  600  L'O 's 

9.85  .Million 

eters  [7]  are  in  Table  2. 


Parameter 

MCM-D 

Pw  (uM) 

50 

Nw 

2 

Wint  (uM) 

25 

Hint  (uM) 

2 

@  1  Ghz  (uM) 

2 

Rint  (ohm/cm) 

3.4 

Dielectric  constant 

3.4 

Vm  (cm/nsec) 

16 

Cint  (pF/cm) 

I.O 

ZO  (ohm) 

60 

Cpad  (pF) 

0.25 

Pp  (uM) 

100 

TABLE  2.  THIN  FILM  HYBRID  PAILAMETERS, 
Average  Delay 

In  general,  the  minimum  size  of  a  logic  gate  has  a 
WjL  ratio  of  2.  Therefore,  we  began  with  a  ratio  of 
2  and,  by  stages,  moved  to  higher  values  in  order  to 
drive  a  load  in  a  small  amount  of  time. 


TABLE  1.  TR.ANSISTOR  COUNT  FOR  THE  VARIOUS  MC  AP  COMPONENTS. 


Thus  the  total  number  of  transistors  for  the  above 
adder  is  approximately  23K.  Similarly,  the  transistor 
count  for  the  pipelined  64-bit  floating  point  multiplier 
is  approximately  61 K  (based  on  a  modified  Booth’s  al¬ 
gorithm).  Likewise,  the  transistor  count  for  the  other 
elements  in  the  MCAP  were  calculated  and  the  re¬ 
sults  are  presented  in  Table  1.  The  resulting  number 
of  transistors  for  the  whole  MCAP  is  approximately 
ten  million. 

3,1.2  Output  Driver  Design 

In  the  proposed  MCAP  architecture,  the  bottle  neck 
is  the  communication  through  link,  single-access,  and 
dual-access  components  because  of  their  high  fanout 
and  large  interconnection  lengths.  This  means  that 
the  output  buffers  for  these  elements  must  be  rela¬ 
tively  large.  We  present  the  delay,  area,  and  power 
dissipation  calculations  for  the  buffers  as  functions  of 
fanout  (F)  and  interconnection  length  (^). 

For  the  chip  level  model,  we  have  assumed  that  the 
input  capacitance  of  a  gate  (including  the  lead  and 
ESD  capacitances)  is  Qn  =  1  pF.  Additional  param- 


FIGLUE  2,  BUFFER  AND  INTERCONNECT  DELAY. 


By  dividing  the  driver  into  a  number  of  buffers 
with  increasing  WjL  ratio,  optimum  speeds  can  be 
achieved.  It  has  been  found  that  a  stage  ratio  of  e 


[8]  gives  best  results.  VVe  have  used  a  stage  ratio  of 
3  for  simplicity.  The  optimum  number  of  stages  (N) 
is  dependant  on  the  load  capacitance  (C/).  The  rela¬ 
tionship  is 

.V  =  0.91(lnC/+4.19) 

where  N  is  truncated  (rounded  down)  to  the  nearest 
integer. 

Using  the  optimum  number  of  stages,  the  average 
delay  is 

Ta,g  =  0.484OV  -  1)  +  5C//3(.V  -  1)  -h  0.076  ns 

Delay  calculations  [9]  are  shown  in  Figure  2. 

Buffer  Area 

A  simple  inverter  with  {W/ L)n  =  (W/L)p  =  2  will 
need  an  area  of  171  A^.  A  buffer  with  equal  rise  (tr) 
and  fall  (tj)  times  requires  {W/L)p  -  2{W/L)n  =  4 
and  the  area  is  going  to  be  203  A“.  The  total  area 
of  the  buffer  depends  on  the  number  of  stages  and, 
hence,  is  a  function  of  F  and  L  We  have 

Area  ^  220  X 

Area  calculations  [9]  are  plotted  in  Figure  3. 


FIGURE  3,  AREA  OF  THE  BLTFERING  STAGE, 

Power  Dissipation  In  The  Buffer 
In  CMOS,  most  of  the  power  is  dissipated  during 
switching  and,  hence,  dynamic  powder  is  approximately 
equal  to  the  total  powder.  The  dynamic  powder  is 

Pd  —  Ct  X  X  fa^g  zz  V^iCl  +  Cbufj)/Tavg 

where  C^ufj  =  0.0152(3^'^''"^)  pF. 


Power  dissipation  calculations  [9]  are  presented  in 
Figure  4.  Since  the  design  of  an  MCAP  uses  asyn- 


HGURE  4.  POWER  DISSIPATED  IN  THE  BUFFER. 

chronous  communication,  the  transfers  over  a  link 
component  involve  the  return  of  an  acknowledge  sig¬ 
nal  and  the  transmission  of  an  output  enable  signal. 
It  is  estimated  that  the  transfer  rate  may  be  as  high  as 
/  =  l/2[Tat.^  +  Tchtp]  Hz  (w'here  Tchip  is  the  delay  of 
the  chip  and  Tavg  is  the  delay  on  the  interconnection) 
Load  Capacitance 
For  the  load  capacitance 

Cl  =  Cinttoi  -r  F  X  Cin. 

wuth 

Qnt,,,  -  Cint  X  i  -  tpY 

w^here  i  is  in  cm  and  C,nf  =  1  pF/cm.  Therefore, 

G  =  (£+F)  pF. 

The  resistance  of  the  interconnect  is 
-  Pint  X  19.  . 

3.1.3  Modified  SUSPENS  Model 

Given  (a)  the  approximate  number  of  logic  gates,  (b) 
the  transistor  technology  parameters,  (c)  packaging 
technology  parameters,  and  (d)  the  number  of  pads 
per  chip  (estimated  from  Rent’s  rule),  the  SUSPENS 
model  can  estimate  system  performance.  Rent’s  rule 
is  an  empirical  result  obtained  by  observing  existing 


designs.  The  design  philosophy  and  methodology  af¬ 
fect  Rent’s  constants.  If  the  predictions  are  made  for 
a  system  with  an  entirely  different  design  philosophy 
from  the  one  from  which  Rent’s  data  were  obtained, 
the  results  will  have  little  meaning.  The  SUSPENS 
rrodel,  as  originally  proposed,  used  Rent’s  constants, 
therefore  we  have  developed  constants  that  are  ap¬ 
plicable  to  the  novel  architecture  of  the  MCAP.  Our 


nCLTlES.  BLOCK  DUCRAM  OF  THE  ADDER  CfflP 


Ctr  (for  1  CMOS)  =  3  fF 
Rtr  (for  I  /imCMOS)  =  15  K 
Ti  =  0.41  ns 

The  output  stage  delay  is 

T,  =  li^.[e,,Cint  +  KiCtr] 

A  0 

+L.R,n,[^-^^^  +  KiCtr] 

iav  (average  interconnection  length) 

Cint  =  2  pF/cm 
=  375  il/cm 

The  total  gate  delay,  Tg  =  Ti  +  To,  is  0.84  ns.  The 
delay  for  an  adder  is 

Tchip  =  fldTg  -h  Ri  71  tC i  nt{  /  2)  +  {Dc/l'c) 

restrict  the  logic  depth,  fid,  to  6 
Vc  =  2.5  X  10^“  cm/sec 
Dc  =  0.3  cm 

Tckip  6(0.84  X  10‘^)  +  375(2  x  10"^^)(0.3V2) 
+(0-3/2. 5  X  10^^) 

=  5-07  ns  (which  includes  the  latch  time, 

logic  time,  setup  time,  and  clock  skew) 

Thus  the  maiximum  frequency  of  the  adder  chip  is  197 
MHz.  By  incorporating  pipelining  and  recalling  that 
the  total  chip  area  actually  has  four  of  these  floating 
point  64-bit  adders,  the  throughput  is  improved  by  a 
factor  of  more  than  four. 


approach  to  developing  these  constants  was  to  deter¬ 
mine  the  area  needed  for  an  inverter  (with  equal  rise 
and  fall  times)  and  a  carry  generator  circuit.  From  this 
we  computed  the  average  area  per  transistor.  Thirty 
percent  of  that  area  was  assumed  to  be  taken  by  the 
interconnections,  resulting  in  a  figure  of  203  per 
transistor.  This  result  was  used  as  input  to  the  SUS¬ 
PENS  model  for  computations  done  at  the  chip  level. 
The  I/O  buffer  areas  were  estimated  separately  as  pre¬ 
sented  in  section  3.1.2.  To  illustrate  the  use  of  the 
SUSPENS  model  using  table  3  [7],  we  will  consider  the 
adder  chip  with  a  1  ^  technology  (see  Figure  5).  The 
adder  chip  contains  four  64-bit  floating  point  adders. 

The  input  stage  delay  is 

Ti  =  /,  ~-3A'oCt. 

Ai 

fg  (number  of  n  —  transistors  in  series)  =  3 
Ki  (W/L  ratio  of  input  transistors)  =  4 
Ro  (W/L  ratio  of  output  buffer)  =  4 


Parameter 

CMOS 

Uff 

I.O 

tgox  (A) 

250 

Vdd  (V) 

3.3 

Rtr  (ohm) 

15,000 

Ctr(fF) 

3.0 

Wint  (uM) 

2.0 

Wsp  (uM) 

2.0 

Hint  (uM) 

0.4 

pw'  (uM) 

4.0 

nw 

3 

Rint  (ohm/cm) 

375 

Cint  (pF/cm)  j 

2.0 

TABLE  3.  1  micron  TECHNOLOGY  PARAMETERS 
The  external  capacitance  of  a  gate  is 

U'ert  ~  f g^av g^int  H"  fgRi^tr  “  36.0  fF , 


W'heretavg  =  81.3x10  ^  cm.  The  internal  capacitance 
of  a  gate  is 


C.nf  ==  3A'oGr  +  oCtr  =  5  1  fF 

The  to  al  capacitance  per  logic,  gate.  Cg  —  Cerf+Gat, 
is  87.5  fF. 

3.2  Interconnection  and  Packaging  Con¬ 
siderations 


Once  the  results  for  each  technology  have  been  ob¬ 
tained  (see  Table  4),  the  package  level  model  is  in¬ 
corporated  using  MCMs  and  WSI  (Table  2). Layout  of 
MCAP  in  a  MCM  configuration  is  shown  in  Fig. 6.  The 
average  interconnection  length  at  the  module  level  (in 
units  of  chip  footprint  size)  is 


Rm 

Rm 


2  -  1  __  1  1  ^4^-1 

4^-0  5  _  I  1  —  4^”®  1  ~ 

.V,  .3  4 

7]  =  0.65  (5ee[7]) 

1.33 


Component 

#  of  Tran’s 

— 

Area 

(cm^) 

#  of  i/o’s 

fc 

(Mhz) 

Pc 

(watt) 

•Adder 

176  k 

0.41 

660 

131 

2.75 

Multiplier 

32Sk 

0.41 

660 

122 

3.44 

C-N-R 

118  k 

0.55 

874 

139 

2.54 

SR.VM 

(4k. 

1.57  M 

0.58 

150 

no 

8.66 

SRAM 

{I6*ik) 

393  k 
{*  16) 

0.16 

150 

(M6} 

no 

2.36 

ROM 

262  k 

0.11 

80 

110 

1.66 

D'Control 

82  k 

0.56 

846 

120 

3.67 

S-control 

32  k 

0.14 

222 

114 

0.70 

Instiv'Bus 

20  k 

0.14 

200 

135 

1.65 

Link 

15  k 

0.14 

200 

123 

1.27 

Module 

10  M 

42,0 

6CO 

100 

68.82 

TABLE  4.  PAR-\METERS  OF  THE  MCAP  ON  MCM-D. 


{ 1  micron  PROCESS) 


As  explained  in  section  3.1.2,  the  output  driver  is 
designed  for  a  critical  length  of  1.5  cm  and  fanout  of 


3.  From  Fig.  2,  the  buffer  delay  is  found  to  be  2.45ns. 
Further,  the  capacitive  delay  (caused  by  the  loading  of 
the  I/O  pins  and  contact  pads)  and  the  time  of  flight 
delay  on  the  transmission  lines  are  both  calculated  (for 
a  1  /i  technology)  using 

'Raddi  ional  ~  2  •  Zq  •  Cpad  “b  —  0.124  nS 

Adding  in  Tchip  —  5-07  ns,  we  determine  that  the  total 
delay  of  the  adder  chip  =  5.07  ns  +  2.45  ns  +  0.124 
ns  =  7.64  ns.  Thus,  the  adder  chip  can  output  at  the 
rate  of  131  MHz. The  dynamic  power  dissipation  per 
gate  is  based  on  the  maximum  adder  chip  operating 
frequency  {fc)  and  the  percentage  of  gates  that  switch 
during  a  clock  period  (/^).  The  dynamic  power  dissi¬ 
pation  is 

p,  =  \fchC,\5D 

=  1(131  X  10®)(0,3)(83.5  X  10"‘-)(3.3)- 

=  56.3  /iW 

The  power  dissipation  of  the  chip  is  the  product 
of  the  number  of  gates  (.V^)  and  the  dynamic  power 
dissipation  per  gate  (Pg)- 

Pc  -  Ng^Pg 

=  (176b/4).(56.3  X  10"^) 

=  1.65W 

Thus  the  power  density  for  the  adder  chip  (area  = 
0.09  cm^)  is  18.4  W/cm".  The  parameters  for  other 
chips  are  calculated  as  described  above,  and  Table  4 
gives  the  results.  A  similar  procedure  is  repeated  for 
0.5  fs  and  0.25  jj  technologies.  The  area  required  by 
the  output  buffers  are  added  to  the  transistor  area  to 
get  the  die  area.  Assuming  an  area  distributed  solder 
bumps  with  100  jjm  diameter  and  250  fjm  pitch.  The 
footprint  of  the  Adder  chip  die  is  found  to  be  0.64  cm. 

The  module  frequency  is,  therefore  appoximately 
100  MHz  (considering  the  processing  and  driving  in¬ 
volved  in  one  cycle).  The  module  size  is  9.0  cm  X  10.0 
cm.  Module  power  dissipation  {Pm)  is  determined  by 

CV.  =  7^A;.Vp(3if^G.  +  2Cpad  +  .R^Fj,C.>,:) 

I  +  Fc  I-  0  ^  ^ 

Crn  =  63,4  nF 

Pm  =  \iFD){fs)iCm){VlD)  =  20-7  VV' 

The  actual  power  dissipation  is  the  greater  of  Pm  ^nd 
the  sum  of  the  power  dissipated  at  all  the  chip  dice. 
Thus  we  calculate  the  power  dissipation  for  the  mod¬ 
ule  to  be  68.82  W. 


4  Conclusions 


5  Acknowledgements 


Design  evaluations  for  implementing  a  novel  mod- 
ularly  configured  attached  processor  architecture  us¬ 
ing  CMOS  logic  on  an  MCM-D  (Fig  6.)  revealed 
that  approximately  ten  million  transistors  will  be 
needed. These  could  be  placed  on  a  set  of  twenty  five 
chip  dice.  Delay,  area,  and  power  calculations  were 
done  with  the  SUSPENS  model  (however.  Rent’s  rule 
was  not  used).  Delay  calculations(including  logic 


nCLTtE  6.  LAYOUT  OF  MCAP  ON  xVICM 
(Ail  the  die  areas  have  been  optimized  for  15  W/sq.cm) 

#  of  Transistors  =  10  Million 

#  of  I/Os  =  600 

delay,  interconnect  delay  and  output  driver  delay) 
showed  that  the  MCAP  module,  on  average,  would 
achieve  speeds  in  the  150  MFLOPS  range.  The  single¬ 
access  memory  controller  component  chip  (S-control) 
was  found  to  be  the  slowest,  110  MHz  for  1  /im,  160 
MHz  for  0.5  /im,  and  220  MHz  for  0.25  /im  CMOS. 
The  areas  w^ere  optimized  for  power  densities,  low 
enough  to  allow’  air  cooling.  Higher  speeds  would  be 
achievable  with  faster  logic  like  BiCMOS,  ECL,  and 
GaAs.  Power  dissipation  calculation  showed  that  ap¬ 
proximately  70  watts  will  be  dissipated  in  the  MCAP 
module  and  air  cooling  would  suffice  for  the  1  p  de¬ 
sign  rule. We  are  in  the  process  of  performing  fur¬ 
ther  design  calculations  involving  Wafer  Scale  Inte- 
gration(WSI)and  GaAs  technology. 
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Abstract 

.4  broad  class  oj  attached  processors  (SdCAPs)  that  are 
constructed  from  a  standard  set  of  connections  and  com- 
pcnents  is  defined  and  a  simulation  package  (SIMARC) 
for  evaluating  members  of  the  class  relative  to  specified 
sets  of  algorithms  is  described.  Together  the  MCAP  con¬ 
nection  and  component  definitions  and  SIMARC  package 
provide  a  fast  prototyping  means  when  designing  efficient 
attached  processors  for  executing  computationally  intense 
algorithms.  A  dynamic  multicolored  graphical  display  dur¬ 
ing  a  simulation  facilitates  the  detection  of  bottlenecks  and 
a  graphical  editor  permits  easy  modification  of  an  architec¬ 
ture.  Use  of  SIMARC  requires  specification  of  the  architec¬ 
ture  and  writing  assembler-level  programs  for  executing  the 
algorithms  of  interest.  example  is  presented  in  which 
SIMARC  IS  used  to  design  an  MCAP  that  achieves  95%  ef¬ 
ficiency  (for  the  processing  logic)  while  performing  matrix 
multiplication.  ^ 

1  INTRODUCTION 

Modern  design  procedures  include  analytical  model¬ 
ing.  simulation  modeling,  simulation  evaluation  and 
prototyping  [1],  with  analytical  modeling  being  the 
mathematical  and  statistical  analysis  of  a  relatively 
crude  system  model.  This  phase  of  the  design  process 
is  the  pencil  and  paper  approach  that  can  be  done 
quickly,  but  is  the  least  accurate  in  predicting  the  per¬ 
formance  of  the  final  product.  Simulation  modeling 
requires  a  much  more  detailed  level  of  system  specifi¬ 
cation,  a  level  that  is  sufficient  to  provide  the  design 
input  to  the  computer  simulation  program,  or  simu¬ 
lator.  that  is  to  be  used.  Test  data  is  then  selected 
and  the  simulator  is  used  to  produce  a  more  accurate 
estimate  of  performance  than  can  be  obtained  from 
analytical  modeling.  The  accuracy  of  this  estimate  de¬ 
pends  on  the  level  of  detail  included  in  the  simulator, 
the  quantity  of  test  data  and  the  care  used  in  choos¬ 
ing  the  test  data.  Simulation  modeling  and  evaluation 

^  The  work  reported  in  this  paper  was  supported  in  part  by 
the  Office  of  Naval  Research  under  Grant  No.  N00014-93-1- 
1343,  .A.ny  opinions,  findings,  and  conclusions  or  recommenda¬ 
tions  expressed  in  this  paper  aire  those  of  the  authors  and  do 
not  necessarily  reflect  the  view  of  the  funding  agency. 


are  iterated  until  satisfactory  results  are  obtained,  at 
which  time  a  prototype  is  built.  If  the  prototype  needs 
only  minor  adjustments,  the  design  team  may  proceed 
with  the  production  design;  otherwise,  it  must  return 
to  an  earlier  phase  and  repeat  the  design  process.  Be¬ 
cause  prototyping  is  normally  expensive,  it  is  impor¬ 
tant  that  the  simulations  be  extensive  enough  to  avoid 
such  returns.  However,  it  is  also  important  that  the 
prototyping  stage  be  reached  as  quickly  as  possible, 
and  much  attention  has  recently  been  given  to  devel¬ 
oping  simulation  and  performance  evaluation  systems 
that  permit  fast  prototyping. 

For  computer  system  design,  simulation  may  be  per¬ 
formed  at  several  levels,  ranging  from  the  major  com¬ 
ponent  level,  to  the  module  level  [printed  circuit 
board,  multichip  module  (MCM)  or  wafer,  whichever 
is  applicable],  to  the  register  transfer  level,  to  the  gate 
level,  and  then  to  the  transistor  level.  As  one  proceeds 
from  the  major  component  level  to  transistor  level, 
clearly  the  amount  of  detail  increases  and  the  time 
needed  to  execute  a  simulation  increases  accordingly. 
Therefore,  there  is  a  tradeoff  between  simulation  time 
and  accuracy.  Normally,  top-down  design  would  be 
used,  which  employs  a  component  level  simulator  to 
design  the  overall  system  and  then  progresses  down¬ 
ward  to  the  transistor  level,  at  which  the  layouts  of  the 
individual  integrated  circuits  are  done  independently. 

To  reduce  the  complexity  of  a  simulator  and  quan¬ 
tity  of  specifications  required  by  it,  simulators  at  the 
major  component  and  module  levels  are  restricted  in 
various  ways.  They  are  normally  restricted  to  a  class 
of  architectures  and  sometimes  to  a  particular  type 
of  computer,  which  serves  to  specify  the  instruction 
set.  Eleven  such  performance  evaluation  tools  that 
relate  algorithms  and  architectures  and  consider  test¬ 
ing  of  programs  to  meet  real-time  constraints  are  sum¬ 
marized  in  [2].  Others  are  described  in  special  issues 
of  the  International  Journal  of  Computer  Simulation 

[3].  [4]. 

As  always,  the  accuracy  of  the  timing  information 
gathered  depends  on  the  level  of  detail  being  simu¬ 
lated.  Some  simulators  are  very  specific  in  their  ap¬ 
plication.  The  advantage  in  restricting  the  use  of  a 
simulator  is,  of  course,  that  highly  reliable  perfor¬ 
mance  data  can  be  gathered  in  a  reasonable  amount 
of  simulation  time.  Of  interest  here  is  an  architecture 


sin  ulation  package,  SIM  ARC,  that  simulates  a  fairly 
bread  class  of  attached  processors  referred  as  Mod- 
ularly  Configurable  Attached  Processors  (MCAPs). 
The  primary  advantage  -f  an  MCAP  is  that  it  is  con¬ 
structed  from  a  stane  ^et  of  connections  and  com¬ 
ponents  that  can  be  y  fit  together  to  match  the 

set  of  algorithms  it  is  .  xecute.  The  SI  MARC  pack¬ 
age  allows  the  quick  reconfiguration  and  simulation  of 
M("AP  architectures  so  that  an  MCAP  design  with 
high  utilization  of  its  logic  for  a  given  set  of  algo 
rit^tms  can  be  achieved  within  a  short  time.  Because 
MCAPs  are  connected  to  host  processors,  there  is  no 
need  to  simulate  operating  system  functions.  Also,  be¬ 
cause  MCAPs  use  memory-tomemory  pipelines,  they 
are  inefficient  when  performing  algorithms  that  in¬ 
clude  a  considerable  amount  of  decision-making  (e.g., 
a  binary  search).  Therefore,  emphasis  is  placed  on 
computationally  intense  algorithms  such  as  those  for 
performing  matrix  operations,  signal  processing  and 
image  processing  and  for  solving  simultaneous  linear 
equations,  partial  differential  equations  and  ordinary 
differential  equations. 

Section  2  gives  a  brief  MCAP  definition  and  the  next 
two  sections  describe  the  simulator  and  provide  an  ex¬ 
ample.  The  last  section  serves  as  a  summary  and  in¬ 
dicates  future  improvements  to  the  simulator. 

2  MCAP  DEFINITION 

An  attached,  or  back-end,  processor  is  a  processing  sys¬ 
tem  that  is  connected  to  a  host  computer  for  the  purpose  of 
very  quickly  executing  most  of  the  overall  system’s  compu¬ 
tational  tasks.  Typical  early  attached  processors  were  the 
AP-120B  and  F PS- 164  made  by  Floating  Point  Systems. 
Inc.,  the  IBM  3838,  and  the  MATP  made  by  Datawest. 
Inc.  [9].  Although  the  early  attached  processors  included 
limited  multiprocessing,  the  more  recently  implemented 
processing  arrays  are  also  controlled  by  a  host  (e.g..  the 
PAX  computer  [10])  and  are  designed  to  perform  most 
of  the  overall  system’s  computational  tasks.  The  specific 
purpose  of  an  attached  processor  is  to  execute  members 
of  a  set  of  algorithms  very  quickly.  The  broader  the  set  of 
algorithms  the  more  generally  applicable  the  attached  pro¬ 
cessor.  The  underlying  goal  of  the  designer  is  to  efficiently 
utilize  the  hardw'are  for  as  broad  a  set  of  algorithms  as  pos¬ 
sible.  By  using  the  MCAP  building  block  approach  along 
with  the  SI  MARC  package,  efficient  matches  between  ar¬ 
chitectures  and  sets  of  algorithms  are  easily  established. 

An  MCAP  is  an  attached  processor  that  is  constructed  en¬ 
tirely  from  a  standard  set  of  connections  and  components. 
This  standard  set  consists  of  two  types  of  asynchronous 
connections  and  twelve  types  of  components.  The  def¬ 
initions  of  the  connection  and  component  types  provide 
a  standard  set  of  rules  that  allow  the  components  to  be 
easily  configured  in  different  ways  to  construct  attached 
processors  that  can  efficiently  perform  different  sets  of  al¬ 
gorithms. 

An  MCAP  has  exactly  one  instruction  component  and  it  is 
connected  to  a  memory  component  for  storing  instructions. 


Most  of  this  memory  component  is  a  ROM  that  contains 
the  subprograms  needed  to  execute  the  algorithms,  but 
some  of  it  is  a  RAM  that  can  receive  instructions  (those 
that  initiate  the  subprograms)  from  the  host. 

An  MCAP  operates  by  drawing  an  instruction  stream  from 
the  instruction  memory  component  into  the  instruction 
component.  The  instruction  component  uses  internal  in¬ 
structions  in  the  stream  to  form  external  ins  ructions  that 
are  then  distributed  to  the  other  non-memor/  components 
through  the  MCAP’s  bus  component,  Tht  external  in¬ 
structions  are  for  setting  up  and  supervising  the  inter¬ 
connected  memory-to-memory  pipelines  within  the  MCAP 
that  perform  the  operation  needed  to  complete  an  algo¬ 
rithm  (or  phase  within  an  algorithm).  The  instruction 
stream  is  illustrated  in  Fig.  1.  Note  that  all  components 
in  the  instruction  stream  include  input  instruction  queues. 
When  the  non-memory  components  have  received  all  of  the 
instructions  needed  to  perform  an  algorithm,  they  auto¬ 
matically  prefetch  the  data  from  the  memory  components, 
route  the  data  to  and  from  the  processor  components  and 
store  the  results  back  into  the  memory  components.  All 
non-memory  components  have  input  data  queues.  Some 
controller  components,  which  are  the  components  that  su¬ 
pervise  all  memory  accessing,  are  used  to  automatically 
transfer  data  between  the  host’s  main  memory  and  the 
MCAP’s  memory  components.  The  instruction  and  data 
streams  are  separate,  thereby  allow'ing  the  instructions 
needed  for  the  next  algorithm  to  be  distributed  while  the 
current  algorithm  is  executing. 
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Fig.  1.  The  instruction  stream 


The  two  types  of  connections  are  referred  to  as  instruction 
and  data  connections.  All  connections  are  unidirectional 
and  asynchronous.  Memory  components  are  considered 
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to  be  integral  parts  of  controller/memory  subsystems  and 
the  design  of  the  connections  between  the  memory  compo¬ 
nents  and  their  controllers  is  left  to  the  designer  of  these 
subsystems. 

Instruction  connections  are  for  passing  instructions  from 
the  instruction  component  to  the  bus  component  and  from 
the  bus  component  to  one  of  the  other  non-memory  compo¬ 
nents.  An  instruction  connection  consists  of  unidirectional 
instruction  and  address  buses  and  a  Req/Ack  handshak¬ 
ing  pair.  The  component  that  is  to  receive  the  instruction 
is  indicated  by  the  a  component  number  on  the  address 
bus.  Data  connections  are  used  to  pass  data  between  com¬ 
ponents  and  consist  of  only  a  unidirectional  data  bus  and 
a  Rcq/Ack  pair.  All  transfers  include  the  latching  of  an 
instruction  or  datum  into  a  queue  at  the  receiving  end. 

The  twelve  types  of  components  are  divided  into  six  cate¬ 
gories  as  indicated  below: 

Instruction  (I) 

Bus  (B) 

Memory  (M) 

Processor 

Elementary-one  input,  one  output  (E) 
Two-input-two  inputs,  one  output  (T) 
Comparator-two  inputs,  one  output  plus 
special  outputs  (C) 

Router 

Join-multiple  inputs,  one  output  (J) 
Fork-one  input,  multiple  outputs  (F) 
Link-multiple  inputs,  multiple  outputs  (L) 
Controller 

RAM-internal  to  MCAP,  no  partitions  (R) 
Single-access-internal  to  MCAP,  has 
partitions  (S) 

Dual- access-connects  to  main  memory,  has 
partitions  (D) 

The  letter  used  to  indicate  each  type  of  component  is 
given  in  parentheses. 

As  mentioned  earlier,  an  MCAP  contains  one  mem¬ 
ory  component  for  storing  instructions,  one  instruc¬ 
tion  component  for  executing  internal  instructions  and 
forming  external  instructions,  and  one  bus  component 
for  distributing  the  instructions.  An  MCAP  may  con¬ 
tain  several  controller,  router,  and  processor  compo¬ 
nents  and  several  other  memory  components  for  stor¬ 
ing  data.  However,  the  other  memory  components  can 
be  connected  to  controller  components  only.  Only  con¬ 
troller  components  are  capable  of  being  programmed 
to  prefetch  data  from  and  deposit  data  into  data  mem¬ 
ory  components. 

Each  non-memory  component  that  is  used  during  the 
execution  of  an  algorithm  contains  an  instruction  in¬ 
put  queue,  one  or  more  data  input  queues,  and  con¬ 
trol  logic  that  includes  a  number  of  registers.  The 
instructions  for  an  algorithm  received  by  a  compo¬ 
nent  fill  these  registers  and  then  the  register  contents 
dictate  the  activity  within  the  component  while  the 
algorithm  is  executed.  They  determine  the  compo¬ 
nent’s  mode  and,  for  a  routing  component,  the  pat¬ 
terns  for  accepting  inputs  and  distributing  outputs. 


For  a  controller  component,  they  determine  the  mem¬ 
ory  part  tions  and  patterns  for  prefetching  operands 
and  storing  results. 

Each  of  the  components  that  receives  instructions  con¬ 
tains  a  Number  of  Operands  Output  (NOO)  register 
that  is  always  the  last  register  filled  before  the  compo¬ 
nent  begins  its  part  in  the  execution  of  the  algorithm. 
Each  tirnt  the  component  outputs  an  operand,  the 
NOO  reps  ter  is  decremented.  When  the  NOO  register 
becomes  zero,  the  component  has  completed  its  part 
in  execuung  the  current  algorithm  (or  phase).  It  may 
then  distribute  new  values,  those  needed  for  the  next 
algorithm,  from  its  instruction  input  queue  to  its  reg¬ 
isters.  This  cycle  may  continue  indefinitely.  Except 
for  reacting  to  the  handshaking  (i.e.,  Req  and  Ack) 
signals  in  its  connections,  each  component  acts  inde¬ 
pendently.  The  data  is  input  to  a  data  queue  through 
an  input  connection,  processed  or  routed  through  a 
bus,  and  output  through  an  output  connection. 

All  controller  components  have  an  output  data  con¬ 
nection  for  outputting  operands  to  the  remainder  of 
the  MCAP  and  an  input  data  connection  for  inputting 
results  from  the  MCAP.  Therefore,  they  must  be  ca¬ 
pable  of  handling  both  an  output  data  stream  and 
an  input  data  stream.  A  queue  is  inserted  in  each  of 
these  data  streams.  In  addition  to  the  NOO  register, 
a  Number  of  Operands  Input  register  is  needed  for  the 
input  stream.  A  D  controller  also  has  a  second  set  of 
input  and  output  connections. 


FT*.  2.  An  Example  MCAP  Arcbit«ture 


For  a  more  complete  descriotion  of  an  MCAP  refer 
to[ll].  An  example  architec  ure  is  given  in  Fig.  2.  Its 
processing  subsection  includes  a  comparator  (C  com¬ 
ponent),  a  negator  (E  component),  a  reciprocator  (E 
component),  a  set  of  ir  pipelined  adders  capable  of 
accumulation,  and  of  four  pipelined  multipliers. 
Each  adder  or  mul  is  constructed  of  four  stages 

(a  T  component  folic  .  ed  by  three  E  components).  All 
communications  to  and  from  the  processing  compo¬ 
nents  are  through  six  link  (L)  components,  three  on 
each  side  of  the  processor.  Join  (J)  and  fork  (F)  com¬ 
ponents  are  provided  to  al  ow  flexible  use  of  the  L 
components.  Also,  to  allow  for  accumulation  there  is 
a  feedback  connection  between  the  F  component  at  the 
output  from  each  adder  and  the  J  component  at  the 
input  to  the  adder.  There  is  a  dual-access  controller 
(D)  component  to  provide  intermediate  memory  and  a 
connection  to  the  single-access  controller  (S)  that  con¬ 
trols  main  memory.  A  second  S  component  provides 
additional  internal  storage. 

3  SIMARC  PACKAGE 

SIMARC  is  a  menu-driven  package  written  using  C++  and 
is  for  PC  compatible  computers  operating  under  DOS.  By 
using  C++  both  the  components  in  an  architecture  and  the 
instructions  in  a  program  could  be  defined  and  operated 
on  as  objects. 

The  programs  in  the  SIMARC  system  communicate 
through  a  collection  of  files.  The  relationships  between 
the  programs  and  the  files  are  summarized  in  Fig.  3.  The 
files  are: 

Archxieciurt  (.SAR)  -  a  complete  description 

of  an  architecture 

InformaUon  (SIF)  -  a  textual  description  of 

the  architecture. 

Program  (.SAS)  -  a  source  program  in  the 

form  needed  by  the 
assembler. 

List  (.SLT)  ~  a  list  of  the  errors 

produced  during  an 
assembly. 

Load  (.SLD)  -  a  program  in  the  form 

needed  by  the  simulator. 

Result  (.SRT)  -  the  results  produced  by 

a  simulation. 

Irregular  intervals  (.SIR)  -  specifies  the  time  at 

which  results  are  to  be 
taken. 

Fractional  result  (.SFT)  -  same  as  an  .SRT  file 

except  results  are 
percentages. 

Summary  (.SUM)  ~  summary  information 

from  succesive  tests. 

The  programs  are: 

EDITOR  -  creates  a  new  architecture  or 

changes  an  existing  architecture, 
assembles  source  program  into 


load  form  th^t  can  be  used  by 
SIMULATE. 

SIMULATE  -  simulates  a  program  on  an 

architecture  and  accumulates  results. 

DISPLAY  -  displays  the  results  of  a  simulation 
on  the  monitor. 

SIMSUMM  -  produces  summary  information  and 
stores  results  as  percentages 

When  using  the  SIMARC  pa:kage,  the  design  proce¬ 
dure  is  to: 

1.  Determine  the  set  of  algorithms  around  which  the 
MCAP  is  to  be  optimized,  the  technology'  and 
packaging  to  be  used  and  the  MCAP's  initial  ar¬ 
chitecture. 

2.  Use  EDITOR  to  graphical  enter  the  architecture's 
description. 

3.  Use  EDIT  to  write,  or  rewrite,  programs  for  the 
algorithms. 

4.  Assemble  the  programs  using  AS  and  the  archi¬ 
tecture  file  created  by  Step  2. 

5.  Simulate  the  programs  using  SIMULATE  and  de¬ 
termine  the  bottlenecks  by  examining  the  graph¬ 
ical  output  and  the  result  files. 

6.  Redesign  the  architecture  to  eliminate  the  bottle¬ 
necks  as  much  as  possible  within  the  constraints 
of  the  technology  and  packaging. 

7.  Repeat  Steps  2  through  6  until  the  design  is  op¬ 
timized. 


Rj,  3.  Relationship  b«rwe«n  tb<  SIM.^RC  packafe^s  programs  md  Hies. 
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From  other  simulation  packages  [12,13],  the  value 
of  graphical  output  duiing  simulation  and  graphical 
specification  of  a  system  so  that  it  can  be  easily  cre¬ 
ated  and  modified  is  well  established.  It  is  seen  from 
Fig.  3  that  EDITOR,  SIMULATE  ?Lnd  DISPLAY  dis¬ 
play  information  on  the  screen.  The  display  is  of  the 
architecture  specified  while  executing  EDITOR.  In  the 
display  each  component  is  shown  as  a  collection  of 
rectangles  which  may  contain  up  to  four  characters  of 
text.  Normally,  the  te^  t  is  an  identifying  mnemonic 
that  has  been  assigned  to  the  component  at  the  time  it 
is  created,  but  it  may  indicate  the  component's  type, 
percentage  of  time  in  a  particular  state,  or  the  com¬ 
ponent's  identifying  number.  A  component  may  also 
be  displayed  using  a  color  that  represents  its  type  or 
current  state.  In  addition,  EDITOR  can  show  an  ar¬ 
chitecture's  connections. 

Upon  starting  the  execution  of  SIMARC,  the  main 
menu  appears.  To  create  or  edit  either  an  architec¬ 
ture  or  program  the  user  would  respond  by  typing 
’’E”,  which  would  cause  the  edit  menu  to  appear.  This 
menu  would  allow  the  user  to  create  or  edit  an  archi¬ 
tecture  or  program,  assemble  a  program,  or  view  the 
errors  resulting  from  an  assembly. 

3.1  EDITOR  Program 

The  program  EDITOR  is  for  creating  or  modifying 
an  architecture  file  and  is  initiated  by  typing  "E”  in 
response  to  the  edit  menu.  It  first  requests  the  name 
of  the  architecture  file.  If  the  file  does  not  exist,  then 
the  architecture  is  to  be  created;  otherwise,  an  existing 
architecture  is  to  be  modified. 

An  architecture  is  defined  by  therfl)  attributes  of  its 
components,  and  (2).  connections  oetween  its  compo¬ 
nents.  There  are  two  kinds  of  attributes.  There  are 
graphical  attributes  for  displaying  the  component  and 
physical  attributes  that  give  the  component’s  char¬ 
acteristics  (its  execution  time,  queue  lengths  and  so 
on).  When  an  architecture  is  created,  a  menu  of  icons 
that  appears  on  the  right  side  of  the  screen  is  used 
to  enter  the  component  attributes  and  connections. 
Because  ail  MCAPs  require  exactly  one  each  of  the 
instruction,  bus  and  instruction  memory  components, 
and  the  connections  between  these  components  and 
the  other  components  must  always  be  the  same,  the 
user  must  enter  only  the  attributes  of  these  compo¬ 
nents.  Their  connections  are  made  automatically.  For 
all  other  components,  the  user  would  typically  enter 
the  physical  attributes  by  responding  to  a  series  of 
prompts,  shape  the  component  by  defining  its  rectan¬ 
gles,  move  the  component  to  the  position  it  is  to  be 
displayed  and  make  the  desired  data  connections  be- 
tw’een  it  and  the  other  components.  A  component's 
graphical  attributes  are  determined  during  the  shap¬ 
ing  and  moving  actions,  EDITOR  includes  an  easy 
means  of  duplicating  a  component  or  flipping  a  com¬ 
ponent's  image  w’ith  respect  to  either  the  horizontal 
or  vertical  axis.  Modifying  an  architecture  may  con¬ 
sist  of  simply  changing  the  attributes  of  some  of  the 
components,  deleting  or  adding  components,  or  delet¬ 
ing  or  adding  connections.  If  a  component  is  deleted, 
all  connections  to  the  component  are  automatically 
deleted. 


3.2  EDIT  and  AS  Programs 

The  DOS  program  EDIT  is  used  for  creating  and  edit¬ 
ing  source  program  files.  It  is  initiated  by  typing  "E" 
in  response  to  the  edit  menu.  The  language  used 
to  create  a  source  program  is  at  the  assembler  level 
with  labels  used  to  specify  instruction  addresses  and 
mnemonics  used  to  specify  instruction  types.  In  addi¬ 
tion,  the  mnemonics  assigned  to  the  components  when 
the  architecture  is  created  are  used  to  specify  the  com¬ 
ponents  that  are  to  receive  the  instructions,  sources  of 
the  input  connections,  and  destinations  of  the  output 
connections. 

The  program  AS  is  initiated  by  choosing  "A”  from 
the  edit  menu.  The  user  then  enters  the  names  of  the 
architecture  and  source  program  files.  As  the  assembly 
takes  place  any  errors  that  occur  are  recorded  in  an 
error  file  and,  at  the  end  of  the  assembly,  the  total 
number  of  errors  is  displayed.  By  pressing  <ENTER> 
a  return  is  made  to  the  edit  menu.  By  choosing  "V” 
from  the  edit  menu,  the  program  EDIT  may  be  used 
to  view  the  contents  of  an  error  file, 

3.3  SIMULATE  Program 

A  flowchart  of  the  program  SIMULATE  is  given  in 
Fig.  4,  When  "S”  is  typed  in  response  to  the  main 
menu,  the  simulation  begins  by  querying  the  user  for 
the  names  of  the  architecture,  program  and  result  files. 
Also,  it  asks  the  user  whether  the  results  are  to  be 
recorded  at  regular  or  irregular  intervals  and,  if  reg¬ 
ular  intervals  are  chosen,  the  length  of  the  interval  is 
requested.  If  irregular  intervals  are  chosen,  the  user 
must  either  enter  the  times  at  which  results  are  to 
be  recorded  or  the  name  of  an  irregular  interval  file. 
SIMULATE  then  enters  its  main  loop. 

SIMULATE  assumes  that  the  possible  states  that  a 
component  can  be  in  are  as  defined  below.  The  com¬ 
ponent  types  that  can  take  on  a  state  are  given  in 
parentheses. 


BUSY  (all  component  types)  - 
WAIT  (all  component  types)  - 


IDLE(I,E,T,CJ,F,L,R.S,D)- 
FREE  (all  component  types)  - 
DIST  (E,T,C.J.F,L.R.S.D)  - 


component  is  actively 
performing  a  function, 
component  is  waiting 
for  its  output  to  be 
taken  or,  for  the  I 
component,  a  flag  to 
be  set- 

component  is  waiting 
for  input- 
component  is 
completely  inactive, 
component  is  distributing 
instructions  to  registers. 


Each  time  the  main  loop  is  executed  the  states  of  all 
components  are  updated  according  to  their  current 
states,  including  the  states  of  their  queues,  and  the 
Ack  inputs  from  their  output  connections.  The  state 
diagram  for  components  that  can  take  on  five  states 
is  given  in  Fig.  5.  Also,  all  queues  are  updated  ac¬ 
cording  to  their  current  states,  Req  inputs  from  their 
input  connections  and  whether  or  not  they  have  been 
popped  by  the  components  containing  them.  No  more 


than  one  stale  change  can  occur  in  a  single  execution 
of  the  main  loop.  Each  time  around  the  loop  to  con¬ 
sume  one  basic  increment  of  time  and  system  time  is 
measured  in  terms  of  the  number  of  basic  increments 
(i.e.,  loop  executions)  that  have  occurred  since  the  be¬ 
ginning  of  the  simulation.  The  simulation  ceases  (i.e., 
the  loop  is  exited)  after  a  HALT  instruction  has  been 
encountered  and  all  components  have  returned  to  their 
FTEE  states.  The  results  that  are  recorded  consist  of 
tfe  times  each  component  spends  in  each  state  and 
the  maximum  number  of  entries  in  each  queue.  These 
times  and  numbers  of  entries  are  also  updated  each 
time  around  the  main  loop.  Whenever  the  system  time 
becomes  one  of  those  specified  as  a  time  to  record  the 
results,  the  results  are  output  to  the  results  file.  The 
results  can  be  used  to  determine  which  components 
need  to  be  faster  or  replicated  and  which  queues  need 
to  be  longer. 


Fig.  5.  Structure  of  state  diagrams. 

Before  entering  the  main  loop,  the  architecture  is  dis¬ 
played  on  the  screen  just  as  it  is  displayed  by  ED¬ 
ITOR,  except  that  the  rectangles  are  filled  in  with 
light  blue.  Each  state  has  been  assigned  a  color  as 
follows;  FREE  is  light  blue,  DIST  is  green,  IDLE  is 
dark  blue,  BISY  is  red  and  WAIT  is  magenta.  Be¬ 
cause  all  components  begin  in  their  FREE  states,  the 
initial  color  of  all  components  is  light  blue.  Each  time 
a  component  changes  its  state,  its  display  may  be  up¬ 
dated;  however,  the  updating  of  the  display  depends 
on  the  current  mode  of  SIMULATE. 

SIMULATE  may  be  dynamically  put  into  different 
modes  by  pressing  different  keys  as  follows:  '"s”  for 
single  step,  '’a'’  for  automatic  stepping,  ’’d”  for  free- 
running  with  display  updating,  and  "n"’  for  free- 
running  with  no  display  updating.  In  single  step  mode 
the  simulation  pauses  and  waits  for  a  key  to  be  pressed 
each  time  around  the  loop.  Automatic  stepping  causes 
a  brief  pause  each  time  around  the  loop,  but  a  key 
need  not  be  pressed.  A  free-running  mode  includes  no 
pauses.  Free-running  with  display  is  used  to  identify 
the  utilization  and  bottlenecks  by  viewing  the  colors 
as  the  simulation  progresses.  The  more  red  the  bet¬ 
ter,  because  red  indicates  a  components  logic  is  being 
utilized.  Dark  blue  shows  that  a  component  is  waiting 
for  input  and  magenta  shows  it  is  waiting  for  output. 
SIMULATE  runs  at  its  maximum  speed  when  free- 
running  with  no  display  is  used.  Pressing  '’q*'  causes 
SIMULATE  to  terminate. 

3.4  Other  Main  Menu  Selections 

The  program  DISPLAY,  which  is  initiated  by  choosing 
’"D"  from  the  mam  menu,  is  used  to  superimpose  per¬ 
centage  results  on  an  architecture  display.  The  user 
must  indicate  the  result  file  to  be  used  and  the  state 
of  interest.  The  percent  of  the  time  spent  in  the  spec¬ 
ified  state  is  displayed  on  the  architecture.  A  display 
of  the  results  at  the  end  of  a  simulation  or  displays 
at  the  intermediate  times  for  which  the  results  w'ere 
recorded  may  be  requested. 

By  selecting  '"G"  from  the  main  menu  the  user  may 
create  a  fractional  result  file  in  which  the  times  spent 
in  the  states  are  replaced  by  the  percentages  of  times 
spent  in  the  states.  This  selection  also  allows  certain 
summary  information  to  be  appended  to  a  summary 
file.  The  selection  "’A”  simply  causes  an  architecture 


lo  be  displayed.  The  user  enters  the  name  of  the  archi* 
lecture  file  in  response  to  a  query.  Typing  "X"  causes 
SI  MARC  to  terminate. 

4  MATCHING  AN  ALGORITHM 
TO  AN  MCAP  ARCHITECTURE 

Matching  an  architecture  to  a  set  of  algorithms  involves 
a  study  relating  the  flows,  storage,  and  processing  of  the 
data  required  by  the  algorithms.  Clearly,  there  is  no  point 
in  increasing  the  speed  of  a  processing  subsystem  if  the 
current  interconnections  and  memory  hierarchy  are  inade> 
quate  to  support  the  processing  or  vice  versa.  But  a  good 
balance  for  one  algorithm  may  not  be  a  good  balance  for  a 
different  algorithm.  What  is  needed  is  a  satisfactory  trade¬ 
off  for  the  work  mix  expected  of  a  system  and  a  means  of 
evaluating  the  design  chosen. 

Space  allows  only  a  single  example,  so  let  us  consider  the 
computation  that  most  frequently  occurs  in  computation¬ 
ally  intense  algorithms,  matrix  multiplication.  Let  us  ex¬ 
amine  how  the  MCAP  in  Fig.  2  could  be  analyzed  for  to  the 
algorithm  AB  =  C  using  the  middle  product  method  [9] 
where  A.  B  and  C  are  n  x  n  matrices.  Fig.  6  shows  the 
data  flow  through  the  MCAP. 


Pig.  6.  Bata  flow  for  matrix  multiplkatiofl  using  the  middle  product  method 

The  algorithm  consists  of  the  computations 

=  C,  »  =  1 . n 

j  =  i 

where  the  a,jS  are  the  elements  of  A,  the  BjS  are  the  rows 
of  B,  and  the  Cs  are  the  rows  of  C.  The  algorithm  pro¬ 
ceeds  by  storing  the  first  row  of  A  in  the  D  component’s 


memory.  Then  the  products  a\jBj,  for  j  =  are 

summed  and  stored  in  the  S  component  at  the  top  of  Fig.  2. 
Next,  the  second  row  of  A  is  brought  into  the  D  compo¬ 
nent  and  the  products  are  summed  and  the  results 

are  sent  to  the  S  component.  This  continues  for  all  rows 
of  A. 

By  matching  this  algorithm  with  the  architecture  in  Fig.  2, 
it  is  seen  that  each  adder  and  multiplier  must  perform  ap- 
proxim  itcly  fi^/4  operations  and  each  link  on  the  left  and 
two  of  .he  links  on  the  right  must  perform  approximately 
transfers.  (The  third  link  on  the  right  is  not  needed.) 
The  approximate  number  of  accesses  to  the  MCAP’s  S 
component  is  about  2n^  and  the  number  of  accesses  to  the 
D  component  and  host’s  S  component  is  about  .  If  T 
is  the  per  stage  processing  time  of  the  multipliers,  then  T 
should  also  be  the  per  stage  processing  time  of  the  adders 
and  r/4  should  be  the  transfer  time  of  the  links.  The  ac¬ 
cess  time  of  the  MCAP’s  S  component  should  be  T/S  for 
both  reads  and  writes.  For  T  =  40  ns,  the  link  transfer 
time  should  be  10ns,  the  average  memory  access  time  for 
the  MCAP’s  S  component  should  be  5  ns,  and  the  aver¬ 
age  memory  access  time  for  the  D  component  and  host’s  S 
component  should  be  10  ns.  The  computation  rate  would 
be  200  Mflops  per  second-  If  the  MCAP  were  pat  into  an 
MCM  or  wafer,  memory  interleaving  were  used,  and  mul¬ 
tiple  connections  are  made  to  the  host’s  memory,  these 
times  would  be  within  the  capability  of  current  HCMOS 
technology. 

To  verify  the  above  simple  analysis,  we  have  designed  a 
simulation  program  for  the  matrix  multiplication  to  be 
run  by  the  MCAP  simulator.  The  instruction  set  for  an 
MC-AP  architecture  consists  of  two  sets  of  instructions, 
internal  and  external.  The  former  is  processed  within 
the  instruction  component  and  the  latter  is  distributed 
to  the  other  components.  In  this  paper  we  will  only  dis¬ 
cuss  the  external  instruction  set.  The  external  instruction 
set  consists  of  three  types  of  instructions:  (1)  instructions 
which  set  the  number  of  operands  to  be  output  from  or 
input  to  a  component.  (2)  instructions  which  set  the  mode 
of  a  component,  and  (3)  instructions  which  set  input  or 
output  connection  patterns  for  the  router  components  or 
partition  and  operand  patterns  for  the  controller  compo¬ 
nents.  When  programming  an  MCAP  architecture,  each 
component  must  be  programmed  individually.  The  oper¬ 
ation  mode,  input/oulpul  connection  patterns,  number  of 
operands  to  be  input  or  output,  broadcasting  connection 
patterns,  etc.,  are  programmed  for  each  component  using 
external  instructions. 

The  operation  of  the  leftmost  link  component,  LINK,  con¬ 
sists  of  broadcasting  one  element  of  matrix  A  and  dis¬ 
tributing  a  row  of  matnx  B  among  four  of  the  join  com¬ 
ponents  that  are  connected  to  the  multipliers.  Thus, 
the  number  of  operands  lo  be  output  by  LINK  is  N  = 
Ti^{n  +  1).  The  following  instructions  program  the  LINK 
component  for  matrix  multiplication: 

ImodLINK,  2  :set  mode 

Isip  LINK,  F077  ;s€t  input 


pattern 

Isbp  LINK,  J059,  J057.  J060,  J062  ;set  broadcast 

pattern 

Isop  LINK,  #n,  4:,  J059.  J057,  J060,  J062  :set  output 

pattern 

Inoo  LINK,  N  ;set  number  of 

operands  out 
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The  first  instruction  sets  the  mode  so  that  the  first 
output  is  to  the  first  connection  indicated  by  the  out* 
put  pattern  and  the  next  n  outputs  are  to  the  con¬ 
nections  indicated  by  cycling  through  the  remainder 
of  the  output  pattern.  The  second  instruction  sets  the 
input  pattern  and  causes  all  input  to  come  from  the 
fork,  F077.  The  third  instruction  sets  the  broadcast¬ 
ing  pattern  by  specifiing  the  components  to  receive  the 
broadcast  operand.  The  fourth  instruction  sets  the 
output  pattern,  which  causes  the  LINK  component  to 
perform  a  broadcast,  indicated  by  the  symbol  k,  and 
then  to  distribute  n  operands  to  the  components  listed 
in  the  instruction.  This  pattern  is  repeated  until  all 
operands  have  been  output. 

The  processing  components,  T  and  E.  can  be  pro¬ 
grammed  as  follows.  Since  each  multiplier  and  adder 
is  composed  of  four  pipeline  stages,  one  T  component 
followed  by  three  E  components  are  needed  to  config¬ 
ure  one  multiplier  or  adder.  The  following  instructions 
assume  M  =  n^/4.  K  =  and  L  =:  nj A  and  set  up 
the  T  component  for  one  of  the  multipliers: 

tmod  T013.  48  ;set  mode 

trep  TO  13.  K  ;set  number  of  repetitions 

tnoo  T013.  L  ;set  number  of  output  operands 

The  mode  of  the  T  component  is  set  so  that  the  first 
operand  is  latched  and  multiplied  by  the  next  L  inputs. 
The  second  instruction  causes  the  L  multiplications  to 
be  repeated  K  times. 

5  SUMMARY  AND  CONCLUSION 

Preliminary  work  indicates  that  MC.AP  component  defini¬ 
tions  when  used  in  conjunction  with  the  SIMARC  pack¬ 
age  do  permit  fast  prototyping  of  attached  processors  de¬ 
signed  for  high  logic  utilization  while  executing  computa¬ 
tionally  intense  algorithms.  It  is  believed  that  SIMARC 
can  produce  designs  for  which  the  sustainable  computa¬ 
tion  rate  can  be  made  to  average  more  than  60%  of  the 
peak  rate,  even  for  fairly  diverse  sets  of  algorithms.  In  the 
matrix  multiplication  example  presented,  the  average  rate 
was  95%  of  the  peak  rate.  Someone  experienced  with  EDI¬ 
TOR  could  produce  the  architecture  shown  in  Fig.  1  in  less 
than  two  hours.  However,  creating  the  programs  for  simu¬ 
lating  the  algorithms  is  much  more  difficult  and  exacting. 
Minor  errors  can  cause  SIMULATE  to  cycle  indefinitely 
and  force  the  user  to  abort  the  simulation.  Future  plans 
call  for  a  compiler  for  creating  simulation  programs  that 
provides  graphical  assistance  in  determining  the  memory- 
to-memory  pipelines.  Such  a  program  would  move  much  of 
the  responsibility  for  details  from  the  user  to  the  compiler 
and  greatly  reduce  the  chance  for  errors. 
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ABSTRACT 

Implementation  of  a  novel  modularly  configured  attached  processor  (MCAP)  architecture 
was  evaluated  using  1  CMOS  logic  on  an  MCM-D.  The  transistor  count  was  approximately 
nine  million  transistors,  distributed  on  tw'enty-five  chip  dies.  Delay,  area,  and  power  calculations 
w'ere  performed  using  the  SUSPENS  model.  Rent's  rule  w^as  found  to  be  not  applicable.  Speed 
w^as  calculated  to  be  in  the  150  MFLOPS  range.  The  module  foot  print  was  calculated  as  90  cm^. 
Power  dissipation  per  unit  area  was  low*  enough  to  allow  air  cooling. 

Index  Terms:  CMOS;  Chip  Design;  Delay;  Powar  Density;  Multichip  Module  (MCM);  At¬ 
tached  Processors. 
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1  Introduction 


An  attached  processor  is  a  processing  system  that  is  connected  to  a  host  computer  for  the 
purpose  of  very  quickly  executing  most  of  the  overall  system  s  computational  tasks.  In  such 
an  organization, “the  host  is  a  program  manager  which  handles  all  I/O,  code  compiling,  and 
operating  system  functions,  while  the  attached  processor  concentrates  on  arithmetic  computation 
with  data  supplied  by  the  host”  [Ij. 

Typical  early  attached  processors  were  the  AP-120B  and  FPS-164  made  by  Floating  Point 
Systems,  Inc.,  the  IBM  3838,  and  the  MATP  made  by  Datawest,  Inc.  [1],  [2],  (3).  These  attached 
processors  aU  have  their  own  data  memories  and  transfer  data  between  these  memories  and 
the  main  memories  of  their  hosts  using  DMA  data  channels.  They  also  include  their  own  code 
memories  where  subprograms  may  be  permanently  stored  or  downloaded  from  their  hosts.  These 
subprograms  are  initiated  by  commands  from  the  host  and  supervise  the  data  flows  from  the 
attached  processor’s  data  memories,  through  the  attached  processor’s  processing  elements,  and 
back  into  the  data  memories. 

In  addition  to  quick  execution,  it  is  also  desirable  to  execute  as  broad  a  set  of  algorithms 
as  possible  in  order  to  create  a  more  generally  applicable  processor.  Thus,  the  underlying  goal 
of  the  designer  is  to  efficiently  utilize  the  hardware  for  as  broad  a  set  of  algorithms  as  possible. 
However,  for  most  current  designs,  the  average  sustainable  execution  rates  have  been  found 
to  be  only  -5%  to  20%  of  their  peak  rates,  which  are  determined  by  summing  the  maximum 
computational  rates  of  the  processing  elements.  For  example,  the  sustainable  rates  for  the  Cray 
X-MP  with  four  processors  may  be  as  low  as  5%  for  some  algorithms  [4].  Although  some  of 
the  lost  efficiency  is  necessitated  by  the  algorithms,  much  of  it  is  due  to  memory  accessing  and 
contention  for  shared  resources  in  general,  including  internal  buses. 

In  this  paper  we  describe  a  modularly  configured  attached  processor  (MCAP)  architecture 
which  can  attain  quickness  and  high  utilization  through:  (1)  closely  matching  their  architectures 
to  the  set  of  algorithms  they  are  to  execute,  (2)  overlapping  of  processing  and  memory  accessing 
by  using  memory  prefetching,  (3)  minimizing  the  movement  of  data,  (4)  using  a  high-speed 
CMOS  with  one  micron  technology,  and  (5)  having  the  whole  MCAP  on  a  single  MCM-D. 


1 


In  Section  2,  the  MCAP  architecture  [5],  [6]  is  summarized.  Next,  in  Section  3,  we  evaluate 
various  design  options  for  implementing  this  MCAP  architecture  in  CMOS  logic. 


2  MCAP  Organization 

An  MCAP  is  an  attached  processor  that  is  constructed  entirely  from  a  standard  set  of 
connections  and  components  (5].  TWs  standard  set  consists  of  three  types  of  asynchronous 
connections  and  twelve  type  of  components.  These  component  types  are  such  that  each  member 
of  the  class  may  include  paraUel  processing,  memory  to  memory  pipelines,  and  be  constructed 
in  a  building  block  fashion.  They  encompass  routing  as  weU  as  memory,  control  and  processing 
components.  By  overlapping  processing  with  memory  accessing  and  matching  an  architecture 
with  a  set  of  algorithms,  it  is  predicted  that  the  average  sustainable  rate  for  a  specific  set  of 
algorithms  can  attain  at  least  60%  of  the  peak  rate.  The  definitions  of  the  connections  and 
component  types  provide  a  standard  set  of  rules.  These  rules  aUow  the  components  to  be 
easily  configured  in  different  ways,  thereby  aUowing  the  construction  of  attached  processors  that 
efficiently  perform  different  sets  of  algorithms. 

Much  of  the  MCAP’s  efficiency  is  gained  by  distributing  the  next  instructions  to  the  various 
components  while  the  current  instruction  is  being  executed.  Once  the  algorithm  begins,  these 
instructions  dictate  the  modes,  routing  patterns,  prefetching  patterns,  etc.  of  the  components 
receiving  them.  Once  an  algorithm  starts,  each  component  operates  more  or  less  on  its  own, 
except  for  responding  to  its  handshaking  signals.  Efficiency  is  further  enhanced  by  prefetching 
operands  from  the  memory  subsystem. 

An  MCAP  operates  by  drawing  an  instruction  stream  from  the  memory  component  into 
the  instruction  component.  The  instruction  component  uses  internal  instructions  in  the  stream 
to  form  external  instructions  that  are  then  distributed  to  the  other  non-memory  components 
through  the  MC.^P’s  bus  component.  All  components  in  the  instruction  stream  include  input 
instruction  queues.  When  the  non-memory  components  have  received  all  of  the  instructions 
needed  to  perform  an  algorithm,  they  automatically  prefetch  the  data  from  the  memory  compo¬ 
nents,  route  the  data  to  and  from  the  processor  components  and  store  the  results  back  into  the 
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memory  components.  All  non-memory  components  have  input  data  queues.  Some  controller 
components,  which  are  the  components  that  supervise  all  memory  accessing,  are  used  to  auto¬ 
matically  transfer  data  between  the  host’s  main  memory  and  the  MCAP’s  memory  components. 
The  instruction  and  data  streams  are  separate,  thereby  allowing  the  instructions  needed  for  the 
next  algorithm  to  be  distributed  while  the  current  algorithm  is  executing. 

The  three  types  of  connections  are  referred  to  as  memory,  instruction,  and  data  connections. 
Instruction  transfers  are  made  through  instruction  connections.  Memory  components  are  con¬ 
nected  to  their  controller  components  using  memory  connections  and  all  other  data  transfers 
are  made  by  means  of  data  connections.  All  connections  are  asynchronous  and,  therefore,  must 
include  handshaking  lines  as  well  as  data  and,  perhaps,  address  lines. 

The  twelve  types  of  components  are  divided  into  six  categories:  (1)  Instruction,  (2)  Bus, 
(3)  Memory,  (4)  Processor,  (5)  Router,  and  (6)  Controller.  The  processor  components  can 
be  subdivided  into  elementary  processors  (one  input,  one  output),  two-input  processors,  and 
comparators.  The  router  components  are  subdivided  into  joins  (multiple  inputs,  one  output), 
forks  (one  input,  multiple  outputs),  and  links  (multiple  inputs,  multiple  outputs).  The  controOer 
components  consist  of  R.\M  controUers,  single  access  controUers,  and  dual  access  controllers. 
The  dual  access  controllers  connect  to  main  memory. 

An  example  architecture  is  given  in  Figure  1.  Its  processing  subsection  includes  a  comparator 
(C-component),  a  negator  (elementary  component),  a  reciprocator  (elementary  component),  a 
set  of  four  pipelined  adders  capable  of  accumulation  (via  feedback),  and  a  set  of  four  pipelined 
multipliers.  Each  adder  and  multiplier  is  constructed  of  four  stages  (a  two-input  component 
followed  by  three  elementary  components).  All  communications  to  and  from  the  processing 
components  are  through  six  link  components,  three  on  each  side  of  the  processor.  Join  and  fork 
components  are  provided  to  allow  flexible  use  of  the  link  components.  There  is  a  dual  access 
component  to  provide  intermediate  memory  and  a  connection  to  main  memory.  The  single 
access  component  provides  internal  storage. 

In  order  to  efficiently  use  the  available  logic  and  interconnections,  an  architecture  must  be 
carefully  matched  to  an  algorithm  or  a  set  of  algorithms.  This  involves  a  study  relating  the  flows. 
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storage  and  processing  of  the  data  required  by  the  algorithm(s).  Clearly,  there  is  no  point  in 
increasing  the  speed  of  a  processing  subsystem  if  the  current  interconnection  delays  and  memory 
hierarchy  are  inadequate  to  support  the  processing  (or  vice  versa).  But  a  good  balance  for  one 
algorithm  may  not  be  equally  good  for  a  different  algorithm.  What  is  needed  is  a  satisfactory 
tradeoff  for  the  work  mix  expected  of  a  system  and  a  moans  of  evaluating  the  design  parameters 
chosen. 

Next  we  present  the  design  considerations  in  implementing  the  architecture  of  Figure  1  in 
CMOS  logic. 

3  Design  Considerations  and  Results 

In  our  evaluation,  CMOS  was  picked  as  the  benchmark  logic  technology  because  of  its  com¬ 
mercial  maturity.  In  the  future,  we  plan  to  evaluate  other  faster  technologies  such  as  GaAs, 
BiCMOS,  and  ECL.  As  for  the  interconnection  and  packaging,  Deposited  Interconnect  Mul¬ 
tichip  Module  technology  (MCM-D)  was  picked  as  the  preferable  vehicle  for  implementation. 
Since  the  signal  delays  associated  with  the  PCB  implementation  are  expected  to  be  prohibitively 
excessive,  it  was  decided  that  the  fabrication  of  an  MCAP  with  MCM  or  Wafer  Scale  Integration 
are  the  only  realistic  alternatives  for  attaining  high  performance.  Further,  multichip  modules 
are  classified  according  to  the  substrate  technology:  MCM-Ceramic,  MCM-Deposition,  and 
MCM-Laminated.  In  the  physical  design,  we  must  face  the  traditional  problems  in  placement 
and  routing  required  by  the  high  performance  systems.  As  the  clock  frequency  is  increased, 
we  need  to  account  for  transmission  line  effects  due  to  long  interconnections.  Parasitics  on  the 
interconnects,  inductances  on  the  power  lines,  and  the  I/O  pin  limitation  3tfe  the  three  vital 
shortcomings  of  current  packaging  technologies,  which  could  be  tackled  by  (a)  minimum  chip 
to  chip  interconnections,  (b)  high  interconnection  density,  and  (c)  parallel  architecture.  Other 
factors  which  need  to  be  considered  are  ground  and  power  plane  generation  and  physical  de¬ 
sign  verification.  The  thermad  considerations  are  a  direct  result  of  the  substrate  type,  bonding 
selection,  and  the  placement  of  chip  dice. 

A  system  level  model,  referred  to  as  the  SUSPENS  model  (Stanford  University  System 
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PErformaNce  Simulator)  [7]  is  used  to  predict  the  performance  of  the  MCAP.  This  mode! 
emphasizes  the  interactions  among  devices,  circuits,  logic,  packaging,  and  architecture.  The 
same  model  could  be  used  to  compare  logic  technologies  (e.g.  CMOS,  Bipolar,  and  GaAs)  and 
various  packaging  technologies  (e.g.  MCM-D,  MCM-C,  PCB,  and  WSI). 

3.1  Chip  Level  Design 

3.1.1  Transistor  Count 

To  illustrate  the  method  used  to  estimate  the  total  number  of  transistors  in  the  example 
MCAP  (Fig.  1),  we  will  consider  one  of  the  floating  point  adders.  Each  adder  has  four  pipelined 
stages  and  uses  the  IEEE  double  precision  standard.  Farther  this  adder  could  be  broken  down 
into:  (a)  nine  64-bit  registers  with  4032  transistors,  (b)  seventy-four  2-input  XOR  gates  with 
.592  transistors,  (c)  one  hundred  and  twenty-six  2-to-l  MUXs  with  504  transistors,  (d)  two  11- 
bit  adders  with  528  transistors,  (e)  one  52-bit  adder  with  1248  transistors,  (f)  a  64-bit  leading 
zero  detector  with  5000  transistors,  (g)  tw'o  52-bit  barrel  shifters  with  4000  transistors,  and  (h) 
rounding  and  other  control  logic  taking  6-500  transistors. 

Thus  the  total  number  of  transistors  for  the  above  adder  is  approximately  23K.  Similarly,  the 
transistor  count  for  the  pipelined  64-bit  floating  point  multiplier  is  approximately  58K  (based 
on  a  modified  Booth's  algorithm).  Likewise,  the  transistor  count  for  the  other  elements  in 
the  MCAP  were  calculated  and  the  results  are  presented  in  table  1.  The  resulting  number  of 
transistors  for  the  whole  MCAP  is  approximately  nine  million. 

3.1.2  Output  Driver  Design 

In  the  proposed  MCAP  architecture,  the  bottle  neck  is  the  communication  through  link, 
single- access,  and  duaJ-access  components  because  of  their  high  fanout  and  large  interconnection 
lengths.  This  means  that  the  output  buffers  for  these  elements  must  be  relatively  large.  We 
present  the  delay,  area,  and  power  dissipation  calculations  for  the  buffers  as  functions  of  fanout 
(F)  and  interconnection  length  (f). 
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For  the  chip  level  model,  we  have  assumed  the  following; 


1.  The  input  capacitance  of  a  gate  (including  the  lead  and  ESD  capacitances)  is  C,„  =  1  pF. 

2.  The  width  of  the  metal  conductor  used  for  an  interconnection  is  W,nt  =  2.0  /xm. 

3.  The  capacitance  of  the  interconnections  is  C,„j  =  2.0  pF/cm. 

4.  The  resistance  of  the  interconnections  is  Hint  =  375  fl/cm. 

5.  The  feature  size  is  Lefj  =  1 

AdditionaJ  parameters  can  be  found  in  Table  2. 

Average  Delay 

In  general,  the  minimum  size  of  a  logic  gate  has  a  W/L  ratio  of  2.  Therefore,  we  began  with 
a  ratio  of  2  and,  by  stages,  moved  to  higher  values  in  order  to  drive  a  load  in  a  small  amount  of 
time.  By  dividing  the  buffering  stages  into  the  number  of  buffers  with  increasing  W/L,  optimum 
speeds  can  be  achieved.  It  has  been  found  that  a  stage  ratio  of  e  [8]  gives  best  results.  We  have 
used  a  stage  ratio  of  3  for  simplicity.  The  optimum  number  of  stages  (N)  is  dependent  on  the 
load  capacitance  (C/).  The  relationship  is 

N  =  0.91(lnC/  +  4.19) 

where  S  is  truncated  (rounded  down)  to  the  nearest  integer. 

Using  the  optimum  number  of  stages,  the  average  delay  is 

=  0.484(iV  -  1)  +  5C,/3(A  -  1)  +  0.076  ns 

Delay  calculations  are  shown  in  Fig.  2. 

Buffer  Area 

A  simple  inverter  with  (W/£<)n  =  iW/L)p  =  2  will  need  an  area  of  66  A  buffer  with 

equal  rise  {tr)  and  fall  (t/)  times  requires  {W/L)^  =  2(W/L)n  =  4  and  the  area  is  going  to  be 
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101  The  total  area  of  the  buffer  depends  on  the  number  of  stages  and,  hence,  is  a  function 
of  F  and  1.  We  have 

Area  =  66  +  3[14(.V  -  1)  +  36(1  +  3  +  3^  +  •  •  •  +  3‘'^'^)]  a  55  x  3^'''"'  ^lTn^ 

Area  calculations  are  plotted  in  Fig.  3. 

Power  Dissipation  In  The  Buffer 

In  CMOS,  most  of  the  power  is  dissipated  during  switching  and,  hence,  dynamic  power  is 
approximately  equaJ  to  the  total  power.  The  dynamic  power  is 

Pi  =  Ct  X  v’  X  /avj  =  3.3^(C/  +  Cbufj)ITa^ 

where  Ci,,//  =  0.0152(3''^-MpF. 

Power  dissipation  calculations  are  presented  in  Fig.  4. 

Since  the  design  of  an  MCAP  uses  asynchronous  communication,  the  transfers  over  a  link 
component  involve  the  return  of  an  acknowledge  signal  and  the  transmission  of  an  output  enable 
signal.  It  is  estimated  that  the  transfer  rate  may  be  as  high  as  /  =  l/2[To„j  +  Tchip]  Hz  (where 
Tj.jj  is  the  delay  on  the  interconnection  and  T^^.p  is  the  chip  delay). 

Load  Capacitance 

For  the  load  capacitance 

Cl  =  Cinttot  F  F  X  C,n, 

with 

C.nt.o.  =  Ci„t  X  /  =  f  pF 

where  t  is  in  cm  and  C,„(  =  1  pF/cm.  Therefore, 

C;  =  (^  +  F)  pF. 

The  resistance  of  the  interconnect  is 

~  ^nt  Xt—  3  A  Xtil  . 
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3.1.3  Modified  SUSPENS  Model 


Given  (a)  the  approximate  number  of  logic  gates,  (b)  the  transistor  technology  parameters, 
(c)  packaging  technology  parameters,  and  (d)  the  number  of  pads  per  chip  (estimated  from  Rent's 
rule),  the  SUSPENS  model  [7]  can  estimate  system  performance.  Rent's  rule  is  an  empirical 
result  obtained  by  observing  existing  designs.  The  design  philosophy  and  methodology  affect 
Rent’s  constants.  If  the  predictions  are  made  for  a  system  with  an  entirely  different  design 
philosophy  from  the  one  from  which  Rent’s  data  were  obtained,  the  results  will  have  little 
meaning.  The  SUSPENS  model,  as  originally  proposed,  used  Rent’s  constants,  therefore  we 
have  developed  constants  that  are  applicable  to  the  novel  architecture  of  the  MCAP. 

Our  approach  to  developing  these  constants  was  to  determine  the  area  needed  for  an  inverter 
(with  equal  rise  and  fall  times)  and  a  carry  generator  circuit.  From  this  we  computed  the  average 
area  per  transistor.  Thirty  percent  of  that  area  was  assumed  to  be  taken  by  the  interconnections, 
resulting  in  a  figure  of  50.7  /im^  per  transistor.  This  result  was  used  as  input  to  the  SUSPENS 
model  for  computations  done  at  the  chip  level.  The  I/O  buffer  areas  were  estimated  separately 
as  presented  in  section  3.1.2. 

To  illustrate  the  use  of  the  SUSPENS  model  (using  table  2),  we  will  consider  the  adder  chip 
(see  Fig.  5).  The  adder  chip  contains  four  64-bit  floating  point  adders.  The  input  stage  delay  is 

T.  = 

/j  (number  of  n  -  transistors  in  series)  =  3 
Ki  (W/L  ratio  of  input  transistors)  =  i 
Kg  (W/L  ratio  of  output  buffer)  =  4 
CtT  (for  I  nm  CMOS)  =  3  fF 
Rtr  (for  1  liva  CMOS)  =  15  K 
T.  =  0.41  ns 

The  output  stage  delay  is 
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^av  (average  intercoanection  length) 

C.n,  =  2  pF/cm 
=  375  fi/cm 

The  total  gate  delay,  T,  =  T,-  +  To,  is  0.84  ns.  The  delay  for  an  adder  is 

TcHip  =  fuT,  +  Ri^tCUDlm  +  {Dolvc) 
restrict  the  logic  depth,  fwi  to  6 
Vc  =  2.5  X  10^^  cm/sec 
Dc  =  0.3  cm 

Tchip  =  6(0.84  X  10"®)  +  375(2  x  10''®)(0.3V2)  +  (0.3/2.5  x  10^^) 

=  5.07  ns  (which  includes  the  latch  time,  logic  time,  setup  time,  and  clock  skew) 

Thus  the  msodmum  frequency  of  the  adder  chip  is  197  MHz.  By  incorporating  pipelining  and 
recadling  that  the  total  chip  area  actually  has  four  of  these  floating  point  64-bit  adders,  the 
throughput  is  improved  by  a  factor  of  more  than  four. 

The  power  dissipation  for  the  adder  chip  is  found  as  follows.  The  external  capacitance  of  a 
gate  is 

Ccxt  —  fg^avgCint  +  fg^i^tr  ~  36.5  fF, 
where  =  81.3  x  10"®  cm.  The  internal  capacitance  of  a  gate  is 

C,ni  =  ^KoCxr  -t-  5Ct.  =  51  fF 

The  total  capaicitance  per  logic  gate,  Cg  =  Cat  +  Cmii  is  87.5  fF. 

The  dynamic  power  dissipation  per  gate  is  based  on  the  maximum  adder  chip  operating 
frequency  (A)  and  the  percentage  of  gates  that  switch  during  a  clock  period  (/i).  The  dynamic 
power  dissipation  is 

Pg  =  \fcUCgVlD 

=  i(131  X  10®)(0.3)(83.5  X  10-'®)(3.3)® 

=  56.3 /iW 
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The  power  dissipation  of  the  chip  is  the  product  of  the  number  of  gates  ( and  the  dynamic 
power  dissipation  per  gate  (P,)- 

Pc  —  ■  Pg 

=  (l76ib/4)-(56.3x  10'®) 

=  1.65W 

Thus  the  power  density  for  the  adder  chip  (area  =  0.09  cm^)  is  18.4  W/cm^. 

3.2  Interconnection  and  Packaging  Considerations 

Once  the  results  for  each  chip  have  been  obtained  (see  Table  4),  the  package  level  model  is 
incorporated  using  thin  film  hybrid  parameters  (Table  3).  The  average  interconnection  length 
at  the  module  level  (in  units  of  chip  footprint  size)  is  [7] 

^  = 


Pm  ~ 

To  find  the  number  of  interconnections  for  the  adder  chip,  we  assume  that  the  output  buffers 
have  a  fanout  of  3  and  a  critical  length  of  1.5  cm.  Thus,  from  Fig.  3,  we  see  that  the  buffer 
delay  is  2.45  ns.  Further,  the  capacitive  delay  (caused  by  the  loading  of  the  I/O  pins  and  contact 
pads)  and  the  time  of  flight  delay  on  the  transmission  lines  are  both  calculated  using 

P additional  ~  2  ‘  '  (^pad  "I"  ^lintl^m  —  0.124  nS 

.\dding  in  Tchxp  -  5  07  ns,  we  deternoine  that  the  total  delay  of  the  adder  chip  =  5.07  ns  +  2.45 
ns  +  0.124  ns  =  7.64  ns.  Thus,  the  adder  chip  can  output  at  the  rate  of  131  MHz. 

The  area  required  by  the  output  buffers  are  added  to  the  transistor  area  to  get  the  die  area. 
Assuming  an  area  distributed  solder  bumps  with  100  /im  diameter  and  2S0  /im  pitch.  The 
footprint  of  the  Adder  chip  die  is  found  to  be  0.64  cm. 


2  *  -  1  1  -  1  _4T)-i 

4n-o.5  _  1  1  _  4.1-0.75  J  j  _  ^JJ-l 


A'  =  4 


T)  -  0.65 
1.33 
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The  results  from  repeating  tills  process  for  the  other  chips  are  presented  in  Table  4  (Note 
that  all  of  the  chip  areas  have  been  optimized  for  a  power  density  of  no  more  than  15  W/cm^). 
Hence,  we  determine  that  the  module  frequency  is  approximately  150  MFLOPS  (considering 
the  processing  and  driving  involved  in  one  cycle).  The  module  size  is  9.0  cm  x  10.0  cm.  Module 
power  dissipation  (Pn,)  is  determined  by 
P  I  —  5^ 

Cm  —  - +  R^FpCint) 

I  +  Fc  1-0 

=  -25(600)[3l^3  X  10-*"  +  2(0.25  x  10'*^)  +  1.33(0.64)(1  x  lO’*^)] 

5  1—5 

=  0.1  mF 

Pm  =  \{FD){fs){Cm){VlD) 

=  i(0.5)(100  x  10®)(0.1  x  10-®)(3.3^) 

=  27.4  W 

The  actual  power  dissipation  is  the  greater  of  Pm  and  the  sum  of  the  power  dissipated  at  all  the 
chip  dice.  Thus  we  report  the  power  dissipation  for  the  module  to  be  68.82  W. 

4  Conclusions 


Design  evaluation  for  implementing  a  novel  modularly  configured  attached  processor  archi¬ 
tecture  using  1  /rm  CMOS  logic  on  an  MCM-D  (see  Fig.  6)  revealed  that  approximately  nine 
million  transistors  wiU  be  needed.  These  could  be  placed  on  a  set  of  twenty  five  chip  dies.  Delay, 
area,  and  power  calculations  were  done  with  the  SUSPENS  model  (however,  Rent’s  rule  was 
not  used).  Delay  calculation  (including  logic  delay,  interconnect  delay  and  output  driver  delay) 
showed  that  the  MCAP  module,  on  average,  would  achieve  speeds  in  the  150  MFLOPS  range. 
The  single-access  memory  controOer  component  chip  (S-control)  was  found  to  be  the  slowest 
(110  MHz).  Higher  speeds  would  be  achievable  with  faster  logic  ’like  BiCMOS,  ECL,  and  GaAs. 
In  fact,  further  calculations  using  0.5  /rm  CMOS  and  0.25  /im  CMOS  logic  show  that  average 
speeds  of  the  S-control  chip  increase  from  110  MHz  (for  1  /im  design  rule)  to  160  MHz  (for 
0.5  design  rule)  and  220  MHz  (for  0.25  urn  design  rule). 
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Power  dissipation  calculation  showed  that  approximately  70  watts  will  be  dissipated  in  the 
MCAP  module  and  air  cooling  would  suffice  for  the  1  ^lm  CMOS  design  rule. 

We  are  in  the  process  of  performing  further  design  calculations  involving  wafer-scale  inte¬ 
gration  (WSI)  and  GaAs  technology. 
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Captions  to  illustrations 


1.  Fig.  1.  An  example  MCAP  architecture. 

2.  Fig.  2.  Buffer  and  interconnect  delay. 

3.  Fig.  3.  Buffer  area. 

4.  Fig.  4.  Power  dissipated  in  the  Buffer. 

5.  Fig.  5.  Block  diagram  of  the  Adder  chip. 
Fig.  6.  Layout  of  the  MCAP  on  an  MCM. 


6. 


ELEMENT 

DESCRIPTION 

#  TRANSISTORS 

Memory  Element 

has  4K  each  of  RAM  and  ROM 

1.84M 

Instruction 

has  8  words  of  FIFO 

10.0  K 

Bus 

has  8  words  of  FIFO 

10.0  K 

Elementaxy 

has  8  words  of  FIFO 

10.0  K 

Two  input 

has  8  words  of  FIFO 

12.0  K 

Join 

3  inputs  arid  8  FIFO 

14.0  K 

Fork 

3  outputs  and  8  FIFO 

07.0  K 

Link 

4  inputs  and  5  outputs 

15.0  K 

Static  Ram 

16  elements  of  1  K  each 

06.3  M 

Single  Access  Controller 

Controls  8  memory  elements 

ll.OK 

Dual  Access  Controller 

Controls  8  memory  elements  and  6 
DMA  channels 

71.0K 

Compare 

sends  out  Flags  and  Indices 

25.0  K 

Reciprocate 

using  Convergence  method 

50.0  K 

Negate 

invert  the  sign  bit 

01.0  K 

Floating  Point  Adder 

using  CLAs,  Barrel  shifters  etc.. 

23.0  K 

Floating  Point  Multiplier 

using  modified  Booth's  algorithm 

61.0  K 

MCAP 

has  12  t3rpes  of  components 

9Mimon 

MCM 

with  25  chips  and  600  I/Os 

9Mmion 

Table.  1  Transistor  count  for  the  various  MCAP  components 


Parameter 

CMOS 

1.0 

t„(A) 

250 

V«(v) 

3.3 

Rv(ohm) 

15  K 

Q,(fF) 

3.0 

Wi.  (m) 

2.0 

W.Oi) 

2.0 

H^Oi) 

0.4 

P,(^) 

4.0 

n* 

3 

(ohm/cm) 

375 

CjH  (pF/cm) 

2.0 

Table  2.  Technology  parameters 
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Parameter 

MCM-D 

P,(m) 

50 

N, 

2 

W^(m) 

25 

w.Oi) 

25 

2.0 

(ohm/cm) 

3.4 

Did.  const. 

3.4 

V„  (cm/nS) 

16 

Cia  (pF/cm) 

1.0 

Zo  (ohm) 

60 

(PF) 

0.25 

Po(m) 

100 

Table  3.  Thin  film  hybrid  parameters 
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1  Parameters  - 

#  Transistors 

#  of  L^O* 

Latency 

(nS) 

Power  Diss. 
(W) 

Area 

(cm^) 

Instn/Bus 

21.2  k 

200 

7.40 

1.65 

0.14 

Link 

16.2  k 

200 

8.16 

1.27 

0.14 

S-control 

32.7  k 

222 

8.76 

0.70 

0.14 

D-cootrol 

85.4  k 

846 

8.34 

3.67 

0.56 

R0M4k 

262.6  k  j 

80 

9.07 

-  - - — ^ 

1.66 

0.11 

SRAM4k 

1.57  M 

150 

9.07 

8.66 

0.58 

SRAM- Ik 

393.6  k 

150 

9.07 

2.36 

0.16 

330  k 

660 

8.17 

3.44 

0.41 

120  k 

874 

7.19 

2.54 

0.55 

178  k 

660 

7.61 

2.75 

0.41 

1  MODULE 

9  Million 

600 

1 

68.8 

90 

Table  4.  Various  parameters  of  the  MCAP  on  an 
MCM-D  using  l/i  CMOS  process 

(optimized  for  Q  s  15  W/cm*) 
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